EXAMINING DISEASE THROUGH MICROBIOME DATA ANALYSIS
Brett Van Tassel, M.S. Data Science
The objective of this project is to examine the relationship between gut microbiomes of human subjects having different disease statuses by examining microbial diversity shifts. Read analysis and data cleaning is recorded from beginning to end so that the unfiltered and unfettered data can be reanalyzed and processed. Here we strive to create a tool that works for well curated data. Data is gathered from the database QIITA and the read data and metadata are queried via the tool redbiom. The initial exploratory analysis involved an examination of metadata attributes. A heat map of correlating attributes of the metadata using Cramer’s V algorithm allows visual correlation examination. Next, we train random forests based on metadata of interest. Due to the large quantity of attributes, many random forests are trained, and their respective significance values and Receiver Operating Characteristic curves (ROC) are generated. ROC curves are used to isolate optimal correlations. This process is built into a pipeline, ultimately allowing the efficient, automated analysis and assignment of disease susceptibility. Alpha and beta diversity metrics are generated and plotted for visual interpretation using QIIME2, a microbial analysis software platform. CLOUD, a tool for finding microbiome outliers, is used to identify markers of dysbiosis and contamination, and to measure rates of successful identification. CLOUD was found to identify positive diagnoses where Random Forests did not when examining positive samples and their predicted diagnosis status. SMOTE was found to perform similarly or slightly poorer compared to random sampling as a data balancing technique.
Examining Disease through Microbiome Data Analysis
EVALUATING HOW NHL PLAYER SHOT SELECTION IMPACTS EVEN-STRENGTH GOAL OUTPUT OVER THE COURSE OF A FULL SEASON
Elliott Barinberg, M.S. Data Science
Within this thesis work, the applications of data collection, machine learning, and data visualization were used on National Hockey League (NHL) shot data collected between the 2014-2015 season and the 2022-2023 season. Modeling sports data to better understand player evaluation has always been a goal of sports analytics. In the modern era of sports analytics the techniques used to quantify impacts on games have multiplied. However, when it comes to ice hockey all the most difficult challenges of sports data analysis present themselves in trying to understand the player impacts of such a continuously changing game-state. The methods developed and presented in this work serve to highlight those challenges and better explain a player’s impact on goal scoring for their team.
Throughout this work there are multiple kinds of modeling techniques used to try to best demonstrate a player’s impact on goal scoring as a factor of all the elements the player is capable of controlling. We try to understand which players have the best offensive process and impact on goal-scoring by caring about the merit of the offensive opportunities they create. It is important to note that these models are not intended to re-create the results seen in reality, although reality and true results are used to evaluate the outputs.
This process used data scraping to collect the data from the NHL public application programming interface (API). Data cleansing techniques were applied to the collected data, yielding custom data sets which were used for the corresponding models. Data transformation techniques were used to calculate additional factors based upon the data provided, thus creating additional data within the training and testing datasets. Techniques including but not limited to linear regression, logistic regression, random forests and extreme gradient boosted regression were all used to attempt to model the true possibility of any particular even-strength event being a goal in the NHL. Then, using formulaic approaches the individual event model was extrapolated upon to draw larger conclusions. Lastly, some unique data visualization techniques were used to best present the outputs of these models. In all, many experimental models were created which have yielded a reproducible methodology upon which to evaluate the results of any NHL player impact upon goal scoring over the course of a season.
BUILDING A STATISTICAL LEARNING MODEL FOR EVALUATION OF NBA PLAYERS USING PLAYER TRACKING DATA
Matthew Byman, M.S. Data Science
This thesis aims to develop faster and more accurate methods for evaluating NBA player performances by leveraging publicly available player tracking data. The primary research question addresses whether player tracking data can improve existing performance evaluation metrics. The ultimate goal is to enable teams to make better-informed decisions in player acquisitions and evaluations.
To achieve this objective, the study first acquired player tracking data for all available NBA seasons from 2013 to 2021. Regularized Adjusted Plus-Minus (RAPM) was selected as the target variable, as it effectively ranks player value over the long term. Five statistical learning models were employed to estimate RAPM using player tracking data as features. Furthermore, the coefficients of each feature were ranked, and the models were rerun with only the 30 most important features.
Once the models were developed, they were tested on a newly acquired player tracking data from the 2022 season to evaluate their effectiveness in estimating RAPM. The key findings revealed that Lasso Regression and Random Forest models performed the best in predicting RAPM values. These models enable the use of player tracking statistics that settle earlier, providing an accurate estimate of future RAPM. This early insight into player performance offers teams a competitive advantage in player evaluations and acquisitions.
In conclusion, this study demonstrates that combining statistical learning models with player tracking data can effectively estimate performance metrics, such as RAPM, earlier in the season. By obtaining accurate RAPM estimates before other teams, organizations can identify and acquire top-performing players, ultimately enhancing their competitive edge in the NBA.
BUILDING AN ML DRIVEN SYSTEM FOR REAL-TIME CODE-PERFORMANCE MONITORING
Mikhail Delyusto, M.S. Data Science
This project is a part of a multidirectional attempt to increase quality of the software and data product that is being produced by Science and Engineering departments of Aetion Inc., the company that is transforming the healthcare industry by providing its partners (major healthcare industry players) with a real-world evidence generation platform, that helps to drive greater safety, effectiveness, and value of health treatments. Large datasets (up to 100Tb each) of healthcare market data (for example, insurance claims) get ingested into the platform and get transformed into Aetion’s proprietary longitudinal format.
This attempt is being led by the Quality Engineering Team and is envisioned to move away from conventional testing techniques by decoupling different moving parts and isolating them in separate, maintainable and reliable tools.
A subject of this thesis is a particular branch of a large quality initiative that will be helping to continuously monitor a number of metrics that are involved in execution of the two most common types of jobs that run on Aetion’s platform: cohorts and analyses. These jobs may take up to a few hours to generate depending on the size of a dataset and the complexity of an analysis.
Implemented, this monitoring system would be supplied with a feed of logs that contain certain data points, like timestamps. Enhanced with a built-in algorithm to set a threshold on the metrics and notify its users (stakeholders from Engineering and Science) when said threshold is exceeded, would be a game-changing capability in Aetion’s quality space. Currently, there is no way to say if any given job is taking more or, otherwise, significantly less time and most of the defects get identified in upper environments (including production).
The issues identified in upper environments are the costlier of all the types and, by different industry considerations, can cost $5000 – $10000 each.
As a result of implementing said system we would expect a steep decrease in a
number of issues in upper environments, as well as an increase in release frequency, that the organization will greatly benefit from.
OPTIMIZING PRODUCT RECOMMENDATION DECISIONS USING SPATIAL ANALYSIS
Raul A. Hincapie, M.S. Data Science
At a certain Consumer Packaged Goods (CPG) company, there was a need to coordinate between sales, geographic location, and demographic datasets to make better-informed business decisions. One area that required this type of coordination was the replacement process of a specific product being sold to a store. The need for this type of replacement arises when a product is not authorized to be sold at the store, out of stock, permanently discontinued, or not selling at the intended rate. Previously, the process at this company relied on instinctual decision-making when it came to product replacements, which showed a need for this protocol to be more data-driven.
The premise of this project is to create a data-driven product replacement process. It would be a type of system where the CPG company inputs a store and a product then it would output a product list with suitable replacement items. The replacement items would be based on stores similar to the input store using its sales, geographic location, and demographic portfolio. By identifying these similar stores, it is possible that the CPG company could also discover product opportunities or niches for a specific store or region. With a system like this, the company will increase their regional product knowledge based on geographical location as well as improve current and future sales. The system could also provide highly valuable information on its consumer preferences and behaviors, which could eventually help to understand future customers.
PREDICTING AND ANALYZING STOCK MARKET BEHAVIOR USING MAGAZINE COVERS
Egor Isakson, M.S. Data Science
Financial magazines have been part of the financial industry right from the start. There has long been a debate whether a stock being featured in a magazine is a contrarian signal. The reasoning behind this is simple; any informational edge reaches the wide masses last, which means by the time that happens, the bulk of the directional move of the financial instrument has long been completed. This paper puts this idea to
the test by examining the behavior of the stock market and the stocks that are featured on magazine covers of various financial magazines and newspapers. By going through several stages of data extraction and processing utilizing a series of most up-to-date data science techniques, ticker symbols are derived from raw colorful images of covers. The derivation results in a many-to-many relationship, where a single ticker shows up at different points in time, at the same time, with a possibility of a single cover having many tickers at once. From then, several historic price and media-related features are created in preparation for the machine learning models. Several models are utilized to look at the behavior of the stock and the index at different points in time in the upcoming future. Results demonstrate more than random results but insufficient as the sole determinant of direction of the asset.
IDENTIFYING OUTLIER DATA POINTS IN NON-CLINICAL INVESTIGATIONAL NEW DRUG SUBMISSIONS
Cassandra O’Malley, M.S. Data Science
The Food and Drug Administration (FDA) uses a format known as SEND (Standard for Exchange of Nonclinical Data) to evaluate non-clinical (animal) studies for investigational new drug applications. Investigative drug sponsors currently use information from historical and control data to determine if drugs cause toxicity.
The goal of this study is to identify outlying data points that may indicate an investigative new drug could be toxic. Examples include a negative body weight gain over time, enlarged organ weights, or laboratory test abnormalities, especially in relation to a control group within the same study. Flagged records can be analyzed by a veterinarian or pathologist for potential signs of toxicity without looking at each individual data point.
Common domains within the non-clinical pharmaceutical studies were evaluated using changes from baseline measurements, changes from the control group, a percent change from the previous measurement with reference to the ethical guidelines, values outside of the mean ± two standard deviations, and a measure of abnormal findings to unremarkable findings in pathology. A program was designed to analyze five of these domains and return a collection of possible outlying data for simpler and faster than individual data point analysis by a study monitor, performing the analysis in a fraction of the time. The resulting file is more easily read by someone unfamiliar with the SEND format.
With this program, analyzing a study for possible toxic effects during the study can save time, effort, and even animal lives by identifying the signs of toxicity early. Sponsors or CROs can determine if the product is safe enough to proceed with testing or should be stopped in the interest of safety and additional research.
CLIMATE CHANGE IMPACTS ON FOOD PRODUCTION: A BIBLIOMETRIC NETWORK ANALYSIS
Skylar Clawson, M.S. Data Science
Climate change is an environmental issue that is affecting many different sectors of society such as terrestrial, freshwater and marine ecosystems, human health and agriculture. With a growing population, food security is a serious issue exacerbated by climate change. Climate change is not only impacting food production, but food production is also impacting climate change by emitting greenhouse gasses during the different stages of the food supply chain. This project seeks to use a bibliometric network analysis to identify the influence that the food supply chain has on climate change. We created four networks for each stage in the food supply chain (food processing, food transportation, food retail, food waste) to distinguish how influential the food supply chain is on climate change. The data needed for a bibliometric network comes from a scientific database and the networks are created based on a co-word analysis. Co-word analysis reveals words that frequently appear together to show that they have some form of a relationship in research publications. The second part of our analysis is more focused on how climate change impacts the early growth and development stages of grains. We collected data on several grains as well as temperature and precipitation to see if the representing climate stressors had any influence on production rates. This project’s main focus is to identify how climate change and food production could be influencing each other. The main findings of this project indicate that all four stages of the food supply chain influence climate change. This project also indicates that climate change affects grain production by different climate variables such as temperature and precipitation variability.
EXPLORING VEHICLE SERVICE CONTRACT CANCELLATIONS
Josip Skunca, M.S. Data Science
The goal of this thesis is to propose the cancellation reserve requirement for ServiceContract.com, a start-up vehicle service contract administrator being formed by its parent company DOWC. DOWC is a vehicle service contract administrator who prides itself on offering customized financial products to large car dealerships. The creation of ServiceContract.com (referred to as ServiceContract) serves to offer no-chargeback products as a means of marketing to another portion of the automotive industry. No-chargeback means that if the contract cancels (after 90 days) the Dealership, Finance Manager, and Agent (account manager) of the account are not required to refund their profit from the insurance contract – the administrator refunds the prorated price of the contract. In other words, the administrator must refund the entirety of the contract’s price, prorated at the time of cancellation.
Therefore, the cancellation reserve is the price that must be collected per contract in order to cover all cancellation costs. This research was a requirement to determine the feasibility of the new company and determine the pricing requirements of its products. The pricing of the new company’s products would determine ServiceContract’s competitiveness in the market, and therefore provide an evaluation of the business model.
To find this reserve requirement, research first started by finding the total amount of money that DOWC has refunded, along with the total number of contracts sold. Adding specific information allowed the calculation of these requirements in the necessary form. Service contract administrators are required to file rate cards with each state that must clearly specify the dimensions of the contract and their corresponding price.
The key result in the research was the realization that the Cancellation Reserve would be tied to the Maximum allowed retail price. If the maximum price dealerships can sell for is lowered, the required Cancellation Reserve will follow suit, and as a result lower the Coverage cost of the contract. This allowed for the dealership to have an opportunity to make their desired profit, while enabling ServiceContract to offer competitive pricing.
The most significant impact of these results is that ServiceContract was able to determine that the company had more competitive rates than both competitors and DOWC. This research opened the company’s eyes to the benefit of this kind of research, and will prompt further research in the future.
A TOOL FOR WHO WILL DROP OUT OF SCHOOL
Colette Joelle Barca, M.S. Data Science
A student’s high school experience often forms the foundation of his or her postsecondary career. As the competition in our nation’s job market continues to increase, many businesses stipulate applicants need a college degree. However, recent studies show approximately one-third of the United States’ college students never obtain a degree. Although colleges have developed methods for identifying and supporting their struggling students, early intervention could be a more effective approach for combating postsecondary dropout rates. This project seeks to use anomaly detection techniques to create a holistic early detection tool that indicates which high school students are most at risk to drop out of college. An individual’s high school experience is not confined to the academic components. As such, an effective model should incorporate both environmental and educational factors, including various descriptive data on the student’s home area, the school’s area, and the school’s overall structure and performance. This project combined this information with data on students throughout their secondary educational careers (i.e., from ninth through twelfth grade) in an attempt to develop a model that could detect during high school which students have a higher probability of dropping out of college. The clustering-based and classification-based anomaly detection algorithms detail the situational and numeric circumstances, respectively, that most frequently result in a student dropping out of college. High school administrators could implement these models at the culmination of each school year to identify which students are most at risk for dropping out in college. Then, administrators could provide additional support to those students during the following school year to decrease that risk. College administrators could also follow this same process to minimize dropout rates.
COMPREHENSIVE ANALYSIS OF THE FUTURE PRICE OF NBA TOP SHOT MOMENTS
Miguel A. Esteban Diaz, M.S. Data Science
NBA Top Shot moments are NFTs built on the FLOW blockchain and created by Dapper Labs in collaboration with the NBA. These NFTs, commonly referred to as “moments”, consist of in-game highlights of an NBA or WNBA player. Using the different variables of a moment, like for example: the type of play done by the player appearing in the moment (dunk, assist, block, etc.), the number of listings of that moment in the marketplace, whether the player appearing in the moment is a rookie or the rarity tier of the moment (Common, Fandom, Rare or Legendary). This project aims to provide a statistical analysis that could yield hidden correlations of the characteristics of a moment and its price, and a prediction of the price of moments with the use of machine learning regression models which include linear regression, random forest or neural networks. As NFTs, and especially NBA Top Shot, are a relatively recent area of research, at the moment there is not extensive research performed about this area. This research has an intent to expand the up to date analysis and research performed in this topic and serve as a foundation for any future research in this area, as well as provide helpful and practical information about the valuation of moments, the importance of the diverse characteristics of moments and impact in the pricing of the moments and the future possible application of this information to other similar highlight-oriented sport NFTs like NFL AllDay or UFC Strike, which are designed similarly to NBA Top Shot.
PREVENTING THE LOSS OF SKILLFUL TEACHERS: TEACHER TURNOVER PREDICTION USING MACHINE LEARNING TECHNIQUES
Nirusha Srishan, M.S. Data Science
Teacher turnover rate is an increasing problem in the United States. Each year, teachers leave their current teaching position to either move to a different school or to leave the profession entirely. In an effort to understand why teachers are leaving their current teaching positions and to help identify ways to increase teacher retention rate, I am exploring possible reasons that influence teacher turnover and creating a model to predict if a teacher will leave the teaching profession. The ongoing turnover of teachers has a vast impact on school district employees, the state, the country, and the student population. Therefore, exploring the variables that contribute to teacher turnover can ultimately lead to decreasing the rate of turnover.
This project compares those in the educational field, including general education teachers, special education teachers and other educational staff, who have completed the 1999-2000 School and Staffing Survey (SASS) and Teacher Follow-up Survey (TFS) from the National Center for Educational Statistics (NCES, n.d.). This data will be used to identify trends in teachers that have left the profession. Predictive modeling will include various machine learning techniques, including Logistic Regression, Support Vector Machines (SVM), Decision Tree and Random Forest, and K-Nearest Neighbors. By finding the reasons for teacher turnover, a school district can identify a way to maximize their teacher retention rate, fostering a supportive learning environment for students, and creating a positive work environment for educators.
FORECASTING AVERAGE SPEED OF CALL CENTER RESPONSES
Emmanuel Torres, M.S. Data Science
Organizations use multifaceted modern call centers and are currently utilizing antiquated forecasting technologies leading to erroneous staffing during critical periods of unprecedented volume. Companies will experience financial hemorrhaging or provide an inadequate customer experience due to incorrect staffing when sporadic volume emerges. The current forecasting models being employed are being used with known caveats such as the inability for the model to handle wait time without abandonment and only considers a single call type when making the prediction.
This study aims to create a new forecasting model to predict the Average Speed of Answer (ASA) to obtain a more accurate prediction of the staffing requirements for a call center. The new model will anticipate historical volume of varying capacities to create the prediction. Both parametric and nonparametric methodologies will be used to forecast the ASA. An ARIMA (Autoregressive Integrated Moving Average) parametric model was used to create a baseline for the prediction. The application of machine learning techniques such as Recurrent Neural Networks (RNN) was used since it can process sequential data by utilizing previous outputs as inputs to create the neural network. Specifically, Long Short-Term Memory (LSTM) recurrent neural networks were used to create a forecasting model for the call center ASA.
With the LSTM neural network a univariate and multivariate approach was utilized to forecast the ASA. The findings confirm that univariate LSTM neural networks resulted in a more accurate forecast by netting the lowest Root Mean Squared Error (RMSE) score from the three methods used to predict the call center ASA. Even though the univariate LSTM model produced the best results, the multivariate LSTM model did not stray far from providing an accurate prediction but received a higher RMSE score compared to the univariate model. Furthermore, ARIMA provided the highest RMSE score and forecasted the ASA inaccurately.
A COMPREHENSIVE EVALUATION ON THE APPLICATIONS OF DATA AUGMENTATION, TRANSFER LEARNING AND IMAGE ENHANCEMENT IN DEVELOPING A ROBUST SPEECH EMOTION RECOGNITION SYSTEM
Kyle Philip Calabro, M.S. Data Science
Within this thesis work, the applications of data augmentation, transfer learning, and image enhancement techniques were explored in great depth with respect to speech emotion recognition (SER) via convolutional neural networks and the classification of spectrogram images. Speech emotion recognition is a challenging subset of machine learning with an incredibly active research community. One of the prominent challenges of SER is a lack of quality training data. The methods developed and presented in this work serve to alleviate this issue and improve upon the current state-of-the-art methodology. A novel unimodal approach was taken in which five transfer learning models pre-trained on the ImageNet data set were used with both the feature extraction and fine-tuning method of transfer learning. Such transfer learning models include the VGG-16, VGG-19, InceptionV3, Xception and ResNet-50. A modified version of the AlexNet deep neural network model was utilized as a baseline for non pre-trained deep neural networks. Two speech corpora were utilized to develop these methods. The Ryerson Audio-Visual Database of Emotional Speech and Songs (RAVDESS) and the Crowd-source Emotional Multimodal Actors dataset (CREMA-D). Data augmentation techniques were applied to the raw audio of each speech corpora to increase the amount of training data, yielding custom data sets. Raw audio data augmentation techniques include the addition of Gaussian noise, stretching by two different factors, time shifting and shifting pitch by three separate tones. Image enhancement techniques were implemented with the aim of improving classification accuracy by unveiling more prominent features in the spectrograms. Image enhancement techniques include conversion to grayscale, contrast stretching and the combination of grayscale conversion followed by contrast stretching. In all, 176 experiments were conducted to provide a comprehensive overview of all techniques that were proposed as well as a definitive methodology. Such methodology yields improved or comparable results to what is currently considered to be state-of-the-art when deployed on the RAVDESS and CREMA-D speech corpora.