Spring 2025
Analyzing Artificial Intelligence’s Ability to Detect
Misinformation
Max Bilyk, M.S. Data Science
Misinformation and disinformation represent critical societal challenges of the 21st century, significantly amplified by rapid advancements in digital technology. The proliferation of generative artificial intelligence (AI) exacerbates these problems, enabling false narratives to spread at unprecedented speeds, undermining public trust, polarizing societies, and endangering democratic processes. Traditional methods, such as manual fact-checking, governmental initiatives, and educational programs, while effective, are increasingly insufficient in addressing the scale and immediacy of digital misinformation.
This thesis aims to critically evaluate artificial intelligence’s potential in addressing the misinformation crisis. Specifically, it investigates how AI-driven techniques, particularly natural language processing (NLP), can improve misinformation detection and fact-checking processes. Further, it examines ethical considerations surrounding AI use, evaluates practical and technical implementation challenges, and proposes solutions to improve these technologies.
A mixed-methods approach was employed, encompassing historical analysis of misinformation, review of existing solutions, examination of contemporary AI technologies, and detailed case studies evaluating AI’s application in real-world misinformation scenarios. Additionally, a quantitative performance analysis of an AI-driven misinformation classifier was conducted using a structured prompt engineering method. This involved scoring news articles on factuality, logic, sentiment, and bias, using a composite measure tested against a labeled dataset of verified true or false articles.
The thesis demonstrated that AI systems, particularly large language models (LLMs), show substantial promise in misinformation detection, achieving over 90% accuracy when optimally calibrated. Real-world case studies, including the UK-based organization Full Fact, 1 revealed AI’s capacity to significantly enhance fact-checking efficiency and responsiveness. Nevertheless, the study identified critical limitations, including AI’s difficulties in nuanced contextual understanding, bias propagation, ethical dilemmas, and environmental sustainability concerns. The research highlights the necessity of continued human oversight—particularly through human-in-the-loop (HITL) models—to address AI’s current limitations.
The findings underscore that AI, while not flawless, holds promise as a scalable, effective tool against misinformation when complemented by rigorous ethical frameworks, transparency (via Explainable AI), multimodal approaches, human-in-the-loop systems and widespread AI literacy initiatives. The broader implications suggest that successful deployment of AI in misinformation detection necessitates interdisciplinary collaboration, proactive bias mitigation, robust public education, and sustained human involvement. Addressing misinformation through AI is not only a technological pursuit but fundamentally an ethical and societal responsibility crucial for maintaining the integrity of public discourse and democratic institutions in the digital age.
Full Text
Exploring Jane Addams Papers Project Documents Through Topic
Modeling and Multilabel Classification
Olivia Church, M.S. Applied Mathematics
The Jane Addams Papers Project at Ramapo College of New Jersey compiles documents relating to Jane Addams. An American activist and social worker, Addams was an influential member of many political and social movements throughout the nineteenth and twentieth centuries, advocating for women’s suffrage, child labor reform, and peace, among other matters. The Digital Edition of the Jane Addams Papers Project contains digital versions of the documents, as well as a variety of other features, such as tags that categorize the documents based on their content. To explore new ways of analyzing and organizing documents from the Digital Edition, two machine learning techniques were implemented: topic modeling and multilabel classification. In addition to extracting insights from the documents and developing an automated method of assigning tags, a central aim of this research was to investigate how topic modeling and multilabel classification can be bridged to enrich analyses of texts.
Using a subset of documents from the Digital Edition, speeches and articles written by Jane Addams, latent Dirichlet allocation (LDA) topic modeling identified central topics, or themes, including international affairs and conflicts, child labor, and women’s suffrage. A variety of multilabel classification models were utilized to predict tags. The problem transformation algorithm Binary Relevance used in conjunction with a Multinomial Naive Bayes classifier had the best performance, though a higher accuracy would have been more desirable. To link the topic modeling and multilabel classification results, each document and tag was assigned to a specific topic. A connection between the topics and predicted tags of documents was evident, with the multilabel classifier often predicting tags related to the topic of their corresponding document. Therefore, when used together, topic modeling and multilabel classification may 1 complement each other, potentially contributing to a greater understanding of the subject matter of texts.
Full Text
Building a Collaborative Recommender System for Magic the
Gathering
Brian DeNichillo, M.S. Data Science
The goal of this research is to address the challenge of deck construction in Magic the Gathering’s Commander format, a task requiring players to create a deck of 100 cards from a card pool of over 28,000 different cards while also adhering to the color identity constraints of the card chosen to be their commander. The objective is to develop a recommendation system, a tool that uses collaborative filtering to suggest relevant cards to the player based deck construction patterns of the commander community.
The recommender system utilizes Alternating Least Squares (ALS) matrix factorization to identify latent features which capture the relationship between cards in a Commander deck. A model was trained using 220,000 player-created decks scraped from a popular deck building website. The model was tuned by systematically testing various configurations of hyperparameters which include latent factors, regularization values, confidence scaling, and iteration counts to determine the optimal configuration.
A final model was produced using 600 latent factors, regularization of 2.25, alpha of 10, and iteration count of 25. This parameter configuration resulted in an F1 score of 0.33 and MRR of 0.063. Additionally, it had a precision@5 of 0.64 and precision@10 of 0.58 when tested with a seed of 40%, meaning that 64% of the top 5 and 58% of the top 10 recommendations appeared in the test decks.
Full Text
Using Free-Text Clinical Notes to Improve Model Performance
in Healthcare
Daniel Figueiras, M.S. Data Science
Predictive models in healthcare often rely solely on structured data, missing crucial context contained in free-text clinical notes and thereby limiting accurate outcome prediction. This study quantified the impact of incorporating free-text discharge summaries alongside structured data to improve one-year mortality prediction by evaluating both resampling techniques and Natural Language Processing (NLP) methods.
Using the MIMIC-IV and MIMIC-IV-Note datasets, five machine learning model types were trained with structured data alone versus structured data combined with insights extracted from clinicals notes using four NLP techniques (Bag of Words (BoW), Binary BoW, Term Frequency-Inverse Document Frequency (TF-IDF), and Sentiment Analysis). A hybrid resampling method addressed severe class imbalance. Performance was primarily evaluated using recall due to the nature of outcomes being predicted.
Baseline models, trained using only structured data, obtained poor recall scores (~0.17). Resampling was essential, boosting average recall by ~61.5%. Integrating clinical notes further improved performance. The gradient boosting model trained using TF-IDF features achieved the highest recall (0.779), a 4.6% gain over its baseline after resampling. TF-IDF and BoW were the most effective NLP methods overall. Key features from the best performing model included age and discharge location (from the structure data) and note terms (i.e. CT, disease).
Overall, the inclusion of free-text clinical notes, combined with effective resampling, significantly enhances the performance of healthcare models, resulting in improved identification of high-risk patients and ultimately contributing to better patient care.
Full Text
Quant Model Tools
Yussof, Kasmi M.S. Data Science
Investing in financial instruments, such as stocks, has been a significant pillar of stability, savings, and wealth generation for decades due to the potential for positive returns. In response to such high investment activity, the stock market has risen exponentially over the last four decades, despite facing challenges such as severe market corrections, financial market crashes, and recessions. Over the years, this has led to maximizing positive investment returns while minimizing negative ones which has led to the emergence of algorithmic trading models. As trading models become more sophisticated, they require a substantial amount of analysis, including testing the model under simulated conditions, comparing trades, cleaning and sourcing price data, and optimizing profitable parameters.
This thesis presents a framework developed in the C++ programming language that enables the execution of multiple trading simulations using user-provided trading models, thereby generating meaningful performance insights for the user-provided model. This framework will allow users to configure their trading model through configuration files, backtest and optimize it across various simulated market conditions, and run WFA (Walk-Forward Analysis), which simulates a trading model against unseen live market data as if it were trading live in the market. Furthermore, the trading model will be able to utilize a wide variety of socks in its portfolio per the user’s request.
The goal of this thesis is to develop the tools to combat overfitting in trading models. As the number of investors, the amount of money, and the complexity of trading activity in the market increases, there is a significant need for a tool such as this framework in developing robust trading models.
Full Text
A Computational Analysis of the Thyroid Imaging and
Reporting Data System
Olivia Luisi, M.S. Computer Science
The detection of thyroid cancer is uniquely based upon a standardized system of numerical analysis. After a nodule is detected on a patient’s thyroid, the ultrasound images are analyzed to determine the level of need for a biopsy. The majority of first world countries follow the basis of the Thyroid Imaging Reporting and Data System created by the American College of Radiology. Each country however has their own rating system to determine the level of danger or suspicion surrounding the nodule, this has led to some country’s systems being more or less sensitive in electing for a biopsy of the nodule. While the United States has a numerical value system, the European Union and South Korea have an algorithmic flow chart to determine the nodules rating, and the newer Chinese system focuses on dominant features of likely malignancy. Each has their own strengths and weaknesses and in an attempt to better explain their differences, comparing their rates of positive identification will allow for a greater understanding. Patients of Thyroid Cancer are rarely given such insight into the mechanisms which declare the safety of their own health, this project seeks to allow patients to see the data behind what they are being told in their reports and compare their own cases against the systems handling of cases like their own.
Full Text
An Empirical Comparative Analysis of Skyline Query
Algorithms for Incomplete Data
Anthony Messana, M.S. Computer Science
Skyline queries are a popular and useful technique for multi-criteria analysis, but the presence of incomplete data complicates the retrieval of the skyline. Namely, incompleteness introduces the problems of intransitivity and cyclic dominance. Over the years, many algorithms have been developed to find skylines over incomplete data by addressing the two aforementioned problems. For software engineers working on Big Data applications or for researchers interested in the incomplete Skyline problem, it can be useful to know in which contexts a particular class of algorithm may perform best. We sought to investigate the differing approaches to dealing with the unique challenges of computing the skyline over incomplete data. We picked three recently developed algorithms to represent certain classes of incomplete skyline algorithms, and benchmarked them in different contexts. We controlled for the correlation of the dataset, the size of the dataset, and the dimensionality of the dataset. The three algorithms, PFSIDS (Liu et. tal), TSI (He et. al) and BTIS(Yuan et. al) represent sorting-based, table-scan based, and bucket-based approaches respectively. We found that the sorting-based algorithm performed the best in general, except in the case of high dimensional anti-correlated data. The table-scan based algorithm was observed to work best in small, higher-dimensional datasets and its performance did not change significantly with respect to the correlation of the data. The bucket-based approach generally performed the worst, which we believe to be due to the overhead of initializing the possibly large number of buckets for each data class.
Full Text
The Impact Of Natural Disasters On Border Crossings In The
US
Harshitha Dalli Sai, M.S. Data Science
The U.S.-Mexico and U.S.-Canada borders are vital to the trade and tourism sector of the country, which influences the economy, but the borders are susceptible to natural disasters. The research explores natural disasters as a factor for border crossing values by utilizing two datasets: U.S. border crossing entry data and FEMA disaster declarations data. The goal is to analyze patterns and use modeling to forecast the border volumes and predict volumes based on disasters to help find the impact of the two.
Exploratory Data Analysis (EDA), K-Means and K-Prototypes clustering, and SARIMA and ARIMA forecast models were applied to the border crossing data set to identify the important factors influencing the border volumes. The two data sets were joined by state, year, and month, for which statistical tests such as the Welch’s test were used to test the difference between volumes one month before and one month after a specific disaster type had occurred. A Generalized Linear model (GLM) with Poisson distribution, and a Negative Binomial model were fit for both the U.S.-Mexico and U.S.-Canada borders after checking for dispersion statistics for predicting border volumes based on disaster count and other border predictors.
The clustering model showed traffic patterns where there were more truck traffic crossings in the southern ports, and personal vehicle traffic was higher in the northern ports. The forecasting models captured seasonal trends, showing future volumes. The disaster periods show higher differences in traffic volumes one month before and after a disaster occurrence, though it is not statistically significant by Welch’s test. The negative binomial models suggested that disaster declarations were not strong predictors of crossing values, though they were slightly positive. Variables such as personal vehicle passengers and personal vehicles are strong 1 predictors for both models. Pedestrian crossings are more impactful in predicting the number of crossings in the Mexico border, while rail containers were more influential in predicting the number of crossings in the Canada border.
These findings call for additional research in the future about the border-specific, state-specific, or disaster-specific analysis to dig deeper into the analysis and help build better disaster response policies. Since the data that is used is considered as count data, which is data aggregated by month or year, different regression models such as Generalized Linear Models were used in the research. In the future, more count analysis and models could be explored, and can be compared to the Generalized Linear Model to compare and find the best fit predictive model.
Full Text
Fall 2024
Game Genie: The Ultimate Video Game Recommendation System
Nicholas D’Amato, M.S. Computer Science
The video game market has a lot of variety and competition. It is hard for gamers to figure out a game to invest their time and money into. A lot of the leading video game companies such as Game Freak, Blizzard, EA, Bethesda, and Activision, created a highly valued reputation with good games in the past. However, in the present, they are making unpolished/unfinished games to get more money which people buy because of their high reputation. With an expert system that suggests games based on user ratings and personalized recommendations rather than popularity, gamers will have a better experience finding a game they prefer. Currently, there are a few of these video game recommendation systems either on the video game console or online, but all of them have flaws to them. These expert systems attempt to recommend games to people and include many different algorithms to predict and suggest games that match a specific user’s preferences. However, these current recommendation systems lack personalized recommendations or lack in the data of games they have thus either giving the same recommendations to all users or giving the user a console-specific game. To solve this problem an expert system with personalized recommendations is needed. The system should incorporate a database that includes a variety of relevant games containing real-world data and put it into a web application. Then making sure the program worked properly and errors and bugs were eliminated from the web application. After completing and testing the proposed recommendation system, the final step was to observe the results and compare the system with other video game recommendation systems.
Full Text
Demonstration of application
GenEthic Analysis: Building a Secure and Accessible Genetic Analysis Framework
Sapir Sharoni, M.S. Data Science
The latest advances in genetic research have paved the way for innovative new applications of genetic data in areas such as ancestry research, forensic science, and medicine. However, the
current Direct-To-Consumer (DTC) genetic platforms often have limited accessibility and utility, posing significant challenges for researchers and other professionals. Furthermore, concerns
about the privacy and security of data within popular DTC companies persist among users. To address these limitations, a framework was developed for a predictive genetic analysis tool
that prioritizes privacy, security, and user-friendliness. This study focused on predicting observable traits, including ancestry, biological sex, blood type, and eye and hair color, using
single nucleotide polymorphisms (SNPs). A machine-learning-driven methodology was employed, integrating data preprocessing, standardized genotype encoding, and model evaluation. Models such as Gradient Boosting and Neural Networks were used to predict traits, demonstrating high accuracy across categories, including blood type and population groups (96%). The results demonstrate that using the proposed framework it is feasible to create a genetic analysis tool capable of bridging the gap between privacy and security and practical usability. It is important to note that the framework presented is adaptable, enabling its application across various industries. While this study focused on observable traits, future research can extend to various domains.
Full Text
Summer 2024
Time Series Analysis on Produce Truck Load Shipments
Ernest Barzaga, M.S. Data Science
The dataset for this project is sourced from a major freight broker based in New Jersey, with an annual revenue of approximately $200 million. The primary objectives are to implement techniques for handling missing data across the client’s highest-volume lanes, to prepare the dataset for modeling and analysis, and to predict truck costs. Additionally, the project explores the impact of external macroeconomic factors on the trucking industry and their relationship to truck costs. In the modeling phase, analysis is focused on a single lane—Salinas, California, to the Bronx, New York—due to its high shipment volume for produce. Various machine learning models were evaluated on this lane, with ARIMA performing best when the year 2022 was excluded from the training set, resulting in a root mean squared error (RMSE) of $436. SARIMA performed best when 2021 was excluded, yielding an RMSE of $834. Based on this initial iteration of modeling, recommendations for future modeling techniques were made, including the use of a vector autoregression model. This suggestion arose from hypothesis tests (Engle-Granger) that indicated the collected macroeconomic factors may have predictive power
regarding truck costs.
Full Text
Spring 2024
Time Series Analysis on Differing Climate Regions
Jacob Insley, M.S. Data Science
Droughts, hurricanes, tornadoes, and other climate disasters wreak havoc in all corners of the world. Constantly, scientists and mathematicians are working on ways to predict such events
and learn more about them. Unfortunately, the weather remains incredibly difficult to predict. If we can learn more about how data science and time series methods work on a variety of climate
regions, we can understand how to put them to better use.
Two locations with very different seasonal patterns were looked at: Bergen County, NJ and Napa County, CA. Droughts were classified in both regions and different drought indices were compared in their ability to identify droughts. Time series techniques were used to predict the amount of precipitation in each location. A least squares regression model with a seasonal component and a SARIMA model were created to predict precipitation. We were able to discover some of the strengths and weaknesses of these tools when used on data from different climate regions.
The Standardized Precipitation Index was able identify short term drought well but failed to identify droughts during Napa County summers. Palmer Drought Severity Index identified droughts all year long in both locations, but only identified droughts if they occurred over several months. The SARIMA model decisively portrayed the seasonal pattern of Napa’s precipitation, but made more accurate predictions for the more stable climate of Bergen County.
Scientists can use this data to better equip these tools to handle situations which they do not excel at. We can use what we learned about how drought is measured in both locations to find ways to improve the indices we measure with.
Full Text
PREDICTING FIRST YEAR RETENTION FOR UNDERGRADUATE EDUCATIONAL OPPORTUNITY FUND STUDENTS
Kelly O’Neill, M.S. Applied Mathematics
Predicting undergraduate retention using various machine learning algorithms has the potential to reduce the likelihood of attrition for students who are identified as being at an elevated risk of dropping out. Thus, providing a mechanism to help increase the likelihood of a student graduating from college. Following the approach of previous studies, retention is predicted using primarily freshman data, where retention is defined as a student being enrolled a year later from their first semester. For this thesis, the population was focused on predicting retention for Educational Opportunity Fund (EOF) students. Based on the EOF department’s most recent report, which comes from Ramapo’s Office of Institutional Research 2023, in 2016, the 4-year graduation rate is 46.40%, and the 6-year graduation rate is 63.10%, whereas for the college, the four-year graduation rate is 56.9%, and the six-year graduation rate is 69.5%, using the Fall 2018 cohort. Through identifying these specific individuals who will not be retained, it allows the EOF department to devise an appropriate plan and provide resources to help the students achieve academic success, and thus increase graduation rates.
This thesis will consider many factors, provided by the EOF department, from Fall 2013 to Spring 2023. I will consider the impact of covid within my analysis. I predict retention using logistic regression, decision tree, random forest, support vector machine, ensemble, and gradient boosting classifier, where feature selection and the Synthetic Minority Over-sampling Technique (SMOTE), since the dataset was not balanced, were used for each algorithm. While all of the models performed well, even after 10-fold cross-validation, the random forest model using feature selection a balanced dataset is recommended. In the future, the EOF department can use this model to determine which incoming students are at elevated risk of dropping out and provide them with the necessary resources to help them succeed.
The second part of this thesis is a comprehensive exploratory data analysis to learn more about the EOF student population. EOF students tend to struggle within the subject areas of math, biology, interdisciplinary studies, psychology, and chemistry. More specifically, in the courses math 108, interdisciplinary study 101, biology 221, critical reading and writing 102, amer/intl interdisciplinary 201, math 101, and math 110. Regarding retention, the average cumulative GPA for students who retained was 2.84, and 2.15 for students who did not retain. Furthermore, the average term GPA for those who retained was 2.67 but was 1.65 for students who did not retain.
Through analyzing the relationship between retention and other variables, such as GPA, subject areas, and courses, it provides the EOF department with a better idea of possible support mechanisms for students. Coupling this information with the recommended prediction algorithm of ensemble learning, can help the EOF department increase their four year and six-year graduation rates, by providing the student(s) with resources, guidance, and plans with their expertise.
Full Text
Fall 2023
PROTOTYPING A LITERARY ANALYSIS TOOL FOR CREATIVES THAT DOESN’T USE CLOUD COMPUTING AND DEVELOPS NEW ANALYSIS METRICS
John Chmielowiec, M.S. Computer Science
The creative process for composing large literary works takes a lot of time. The writing itself is the largest time sink, but the editing and following revising process also takes a lot of time. This is primarily due to the fact that humans are limited in terms of the speed that they are able to read a work and then subsequently communicate their feedback to the author.
Natural language processing (NLP) is a computer science and data science topic that has garnered a lot of public interest over 2023 due to the release of tools such as ChatGPT. One of the features of NLP is the ability to perform analysis on literature. Some of the current capabilities include sentiment analysis (is a work positive, neutral, or negative), named entity recognition (what people, places, organizations, etc. are mentioned), and detecting whether something is sarcastic or not. These tools can be used to provide more immediate feedback to a writer and may even be able to notice details that experienced editors might miss. However, these tools are often only available online and require the submission of data that a user may want to keep confidential.
The aim of this thesis is to show there can be a viable application that runs locally a user’s computer that can perform some natural language processing tasks. This program needs to be simple to use, perform effective analysis, and enable the end user to explore the data in meaningful ways. The target audience for this product would include writers, editors, publishers, and data scientists.
To this end, existing software libraries were examined for functionality, accuracy, and speed. Suitable components were selected and incorporated into a single tool. Existing concepts such as version control for software development and known NLP functionality were combined in new ways. That proof-of-concept tool was then used to produce sample results to show the viability of the product.\
Full Text
COMBINING STATISTICAL ANALYSIS AND MACHINE LEARNING TO EXPLORE THE INTERPLAY BETWEEN AGING, LIFESTYLE CHOICES, CARDIOVASCULAR DISEASES, AND BRAIN STROKES
Anit Mathew, M.S. Data Science
This research project aims to investigate the intricate relationship between aging, lifestyle choices, cardiovascular diseases (CVDs), and brain strokes in older adults. The pressing problem is the growing burden of CVDs and strokes among the elderly, and the need to understand the impact of lifestyle factors on these health outcomes.
The primary objectives of this study are to assess the incidence of heart disease, high blood pressure, and brain strokes in the senior population, analyze how lifestyle factors like smoking status and body mass index (BMI) influence stroke frequency, explore the connections between heart disease, hypertension, and strokes, and investigate the potential influence of additional variables such as gender, average blood glucose levels, and type of residence. Furthermore, this research seeks to propose interventions and preventive strategies to reduce the incidence of brain strokes among older adults.
This research employs a comprehensive analysis of a publicly accessible dataset from Kaggle, which contains a wide range of health-related variables. The dataset provides valuable insights into lifestyle choices, health conditions, and the occurrence of brain strokes in older individuals. Various statistical and data analysis techniques will be applied to uncover associations and trends, contributing to a deeper understanding of the complex interactions between lifestyle choices, CVDs, and brain strokes.
Through a meticulous examination of the data, this study intends to shed light on the multifaceted relationships among lifestyle choices, cardiovascular diseases, and strokes in the elderly. The results will contribute to public health, geriatrics, and medical fields by providing evidence-based knowledge that can inform strategies for risk assessment, disease management, and health promotion among older adults.
This research project holds the potential to benefit multiple stakeholders. For healthcare professionals, the findings can lead to the development of effective strategies for the management of CVDs and strokes in older individuals. It may also inform public health campaigns and policy initiatives aimed at reducing the risk of these conditions within an aging population. Additionally, the study contributes to the existing body of knowledge in this field, providing a foundation for further research and the potential discovery of new interventions and risk mitigation strategies. Overall, this research addresses a critical health concern affecting older adults and has the potential to improve the well-being of this vulnerable population.
Full Text
EXAMINING DISEASE THROUGH MICROBIOME DATA ANALYSIS
Brett Van Tassel, M.S. Data Science
The objective of this project is to examine the relationship between gut microbiomes of human subjects having different disease statuses by examining microbial diversity shifts. Read analysis and data cleaning is recorded from beginning to end so that the unfiltered and unfettered data can be reanalyzed and processed. Here we strive to create a tool that works for well curated data. Data is gathered from the database QIITA and the read data and metadata are queried via the tool redbiom. The initial exploratory analysis involved an examination of metadata attributes. A heat map of correlating attributes of the metadata using Cramer’s V algorithm allows visual correlation examination. Next, we train random forests based on metadata of interest. Due to the large quantity of attributes, many random forests are trained, and their respective significance values and Receiver Operating Characteristic curves (ROC) are generated. ROC curves are used to isolate optimal correlations. This process is built into a pipeline, ultimately allowing the efficient, automated analysis and assignment of disease susceptibility. Alpha and beta diversity metrics are generated and plotted for visual interpretation using QIIME2, a microbial analysis software platform. CLOUD, a tool for finding microbiome outliers, is used to identify markers of dysbiosis and contamination, and to measure rates of successful identification. CLOUD was found to identify positive diagnoses where Random Forests did not when examining positive samples and their predicted diagnosis status. SMOTE was found to perform similarly or slightly poorer compared to random sampling as a data balancing technique.
Full Text
Summer 2023
EVALUATING HOW NHL PLAYER SHOT SELECTION IMPACTS EVEN-STRENGTH GOAL OUTPUT OVER THE COURSE OF A FULL SEASON
Elliott Barinberg, M.S. Data Science
Within this thesis work, the applications of data collection, machine learning, and data visualization were used on National Hockey League (NHL) shot data collected between the 2014-2015 season and the 2022-2023 season. Modeling sports data to better understand player evaluation has always been a goal of sports analytics. In the modern era of sports analytics the techniques used to quantify impacts on games have multiplied. However, when it comes to ice hockey all the most difficult challenges of sports data analysis present themselves in trying to understand the player impacts of such a continuously changing game-state. The methods developed and presented in this work serve to highlight those challenges and better explain a player’s impact on goal scoring for their team.
Throughout this work there are multiple kinds of modeling techniques used to try to best demonstrate a player’s impact on goal scoring as a factor of all the elements the player is capable of controlling. We try to understand which players have the best offensive process and impact on goal-scoring by caring about the merit of the offensive opportunities they create. It is important to note that these models are not intended to re-create the results seen in reality, although reality and true results are used to evaluate the outputs.
This process used data scraping to collect the data from the NHL public application programming interface (API). Data cleansing techniques were applied to the collected data, yielding custom data sets which were used for the corresponding models. Data transformation techniques were used to calculate additional factors based upon the data provided, thus creating additional data within the training and testing datasets. Techniques including but not limited to linear regression, logistic regression, random forests and extreme gradient boosted regression were all used to attempt to model the true possibility of any particular even-strength event being a goal in the NHL. Then, using formulaic approaches the individual event model was extrapolated upon to draw larger conclusions. Lastly, some unique data visualization techniques were used to best present the outputs of these models. In all, many experimental models were created which have yielded a reproducible methodology upon which to evaluate the results of any NHL player impact upon goal scoring over the course of a season.
Full text
Spring 2023
BUILDING A STATISTICAL LEARNING MODEL FOR EVALUATION OF NBA PLAYERS USING PLAYER TRACKING DATA
Matthew Byman, M.S. Data Science
This thesis aims to develop faster and more accurate methods for evaluating NBA player performances by leveraging publicly available player tracking data. The primary research question addresses whether player tracking data can improve existing performance evaluation metrics. The ultimate goal is to enable teams to make better-informed decisions in player acquisitions and evaluations.
To achieve this objective, the study first acquired player tracking data for all available NBA seasons from 2013 to 2021. Regularized Adjusted Plus-Minus (RAPM) was selected as the target variable, as it effectively ranks player value over the long term. Five statistical learning models were employed to estimate RAPM using player tracking data as features. Furthermore, the coefficients of each feature were ranked, and the models were rerun with only the 30 most important features.
Once the models were developed, they were tested on a newly acquired player tracking data from the 2022 season to evaluate their effectiveness in estimating RAPM. The key findings revealed that Lasso Regression and Random Forest models performed the best in predicting RAPM values. These models enable the use of player tracking statistics that settle earlier, providing an accurate estimate of future RAPM. This early insight into player performance offers teams a competitive advantage in player evaluations and acquisitions.
In conclusion, this study demonstrates that combining statistical learning models with player tracking data can effectively estimate performance metrics, such as RAPM, earlier in the season. By obtaining accurate RAPM estimates before other teams, organizations can identify and acquire top-performing players, ultimately enhancing their competitive edge in the NBA.
Full text
BUILDING AN ML DRIVEN SYSTEM FOR REAL-TIME CODE-PERFORMANCE MONITORING
Mikhail Delyusto, M.S. Data Science
This project is a part of a multidirectional attempt to increase quality of the software and data product that is being produced by Science and Engineering departments of Aetion Inc., the company that is transforming the healthcare industry by providing its partners (major healthcare industry players) with a real-world evidence generation platform, that helps to drive greater safety, effectiveness, and value of health treatments. Large datasets (up to 100Tb each) of healthcare market data (for example, insurance claims) get ingested into the platform and get transformed into Aetion’s proprietary longitudinal format.
This attempt is being led by the Quality Engineering Team and is envisioned to move away from conventional testing techniques by decoupling different moving parts and isolating them in separate, maintainable and reliable tools.
A subject of this thesis is a particular branch of a large quality initiative that will be helping to continuously monitor a number of metrics that are involved in execution of the two most common types of jobs that run on Aetion’s platform: cohorts and analyses. These jobs may take up to a few hours to generate depending on the size of a dataset and the complexity of an analysis.
Implemented, this monitoring system would be supplied with a feed of logs that contain certain data points, like timestamps. Enhanced with a built-in algorithm to set a threshold on the metrics and notify its users (stakeholders from Engineering and Science) when said threshold is exceeded, would be a game-changing capability in Aetion’s quality space. Currently, there is no way to say if any given job is taking more or, otherwise, significantly less time and most of the defects get identified in upper environments (including production).
The issues identified in upper environments are the costlier of all the types and, by different industry considerations, can cost $5000 – $10000 each.
As a result of implementing said system we would expect a steep decrease in a
number of issues in upper environments, as well as an increase in release frequency, that the organization will greatly benefit from.
Full text
OPTIMIZING PRODUCT RECOMMENDATION DECISIONS USING SPATIAL ANALYSIS
Raul A. Hincapie, M.S. Data Science
At a certain Consumer Packaged Goods (CPG) company, there was a need to coordinate between sales, geographic location, and demographic datasets to make better-informed business decisions. One area that required this type of coordination was the replacement process of a specific product being sold to a store. The need for this type of replacement arises when a product is not authorized to be sold at the store, out of stock, permanently discontinued, or not selling at the intended rate. Previously, the process at this company relied on instinctual decision-making when it came to product replacements, which showed a need for this protocol to be more data-driven.
The premise of this project is to create a data-driven product replacement process. It would be a type of system where the CPG company inputs a store and a product then it would output a product list with suitable replacement items. The replacement items would be based on stores similar to the input store using its sales, geographic location, and demographic portfolio. By identifying these similar stores, it is possible that the CPG company could also discover product opportunities or niches for a specific store or region. With a system like this, the company will increase their regional product knowledge based on geographical location as well as improve current and future sales. The system could also provide highly valuable information on its consumer preferences and behaviors, which could eventually help to understand future customers.
Full Text
PREDICTING AND ANALYZING STOCK MARKET BEHAVIOR USING MAGAZINE COVERS
Egor Isakson, M.S. Data Science
Financial magazines have been part of the financial industry right from the start. There has long been a debate whether a stock being featured in a magazine is a contrarian signal. The reasoning behind this is simple; any informational edge reaches the wide masses last, which means by the time that happens, the bulk of the directional move of the financial instrument has long been completed. This paper puts this idea to
the test by examining the behavior of the stock market and the stocks that are featured on magazine covers of various financial magazines and newspapers. By going through several stages of data extraction and processing utilizing a series of most up-to-date data science techniques, ticker symbols are derived from raw colorful images of covers. The derivation results in a many-to-many relationship, where a single ticker shows up at different points in time, at the same time, with a possibility of a single cover having many tickers at once. From then, several historic price and media-related features are created in preparation for the machine learning models. Several models are utilized to look at the behavior of the stock and the index at different points in time in the upcoming future. Results demonstrate more than random results but insufficient as the sole determinant of direction of the asset.
Full Text
IDENTIFYING OUTLIER DATA POINTS IN NON-CLINICAL INVESTIGATIONAL NEW DRUG SUBMISSIONS
Cassandra O’Malley, M.S. Data Science
The Food and Drug Administration (FDA) uses a format known as SEND (Standard for Exchange of Nonclinical Data) to evaluate non-clinical (animal) studies for investigational new drug applications. Investigative drug sponsors currently use information from historical and control data to determine if drugs cause toxicity.
The goal of this study is to identify outlying data points that may indicate an investigative new drug could be toxic. Examples include a negative body weight gain over time, enlarged organ weights, or laboratory test abnormalities, especially in relation to a control group within the same study. Flagged records can be analyzed by a veterinarian or pathologist for potential signs of toxicity without looking at each individual data point.
Common domains within the non-clinical pharmaceutical studies were evaluated using changes from baseline measurements, changes from the control group, a percent change from the previous measurement with reference to the ethical guidelines, values outside of the mean ± two standard deviations, and a measure of abnormal findings to unremarkable findings in pathology. A program was designed to analyze five of these domains and return a collection of possible outlying data for simpler and faster than individual data point analysis by a study monitor, performing the analysis in a fraction of the time. The resulting file is more easily read by someone unfamiliar with the SEND format.
With this program, analyzing a study for possible toxic effects during the study can save time, effort, and even animal lives by identifying the signs of toxicity early. Sponsors or CROs can determine if the product is safe enough to proceed with testing or should be stopped in the interest of safety and additional research.
Full Text
Fall 2022
CLIMATE CHANGE IMPACTS ON FOOD PRODUCTION: A BIBLIOMETRIC NETWORK ANALYSIS
Skylar Clawson, M.S. Data Science
Climate change is an environmental issue that is affecting many different sectors of society such as terrestrial, freshwater and marine ecosystems, human health and agriculture. With a growing population, food security is a serious issue exacerbated by climate change. Climate change is not only impacting food production, but food production is also impacting climate change by emitting greenhouse gasses during the different stages of the food supply chain. This project seeks to use a bibliometric network analysis to identify the influence that the food supply chain has on climate change. We created four networks for each stage in the food supply chain (food processing, food transportation, food retail, food waste) to distinguish how influential the food supply chain is on climate change. The data needed for a bibliometric network comes from a scientific database and the networks are created based on a co-word analysis. Co-word analysis reveals words that frequently appear together to show that they have some form of a relationship in research publications. The second part of our analysis is more focused on how climate change impacts the early growth and development stages of grains. We collected data on several grains as well as temperature and precipitation to see if the representing climate stressors had any influence on production rates. This project’s main focus is to identify how climate change and food production could be influencing each other. The main findings of this project indicate that all four stages of the food supply chain influence climate change. This project also indicates that climate change affects grain production by different climate variables such as temperature and precipitation variability.
Full Text
EXPLORING VEHICLE SERVICE CONTRACT CANCELLATIONS
Josip Skunca, M.S. Data Science
The goal of this thesis is to propose the cancellation reserve requirement for ServiceContract.com, a start-up vehicle service contract administrator being formed by its parent company DOWC. DOWC is a vehicle service contract administrator who prides itself on offering customized financial products to large car dealerships. The creation of ServiceContract.com (referred to as ServiceContract) serves to offer no-chargeback products as a means of marketing to another portion of the automotive industry. No-chargeback means that if the contract cancels (after 90 days) the Dealership, Finance Manager, and Agent (account manager) of the account are not required to refund their profit from the insurance contract – the administrator refunds the prorated price of the contract. In other words, the administrator must refund the entirety of the contract’s price, prorated at the time of cancellation.
Therefore, the cancellation reserve is the price that must be collected per contract in order to cover all cancellation costs. This research was a requirement to determine the feasibility of the new company and determine the pricing requirements of its products. The pricing of the new company’s products would determine ServiceContract’s competitiveness in the market, and therefore provide an evaluation of the business model.
To find this reserve requirement, research first started by finding the total amount of money that DOWC has refunded, along with the total number of contracts sold. Adding specific information allowed the calculation of these requirements in the necessary form. Service contract administrators are required to file rate cards with each state that must clearly specify the dimensions of the contract and their corresponding price.
The key result in the research was the realization that the Cancellation Reserve would be tied to the Maximum allowed retail price. If the maximum price dealerships can sell for is lowered, the required Cancellation Reserve will follow suit, and as a result lower the Coverage cost of the contract. This allowed for the dealership to have an opportunity to make their desired profit, while enabling ServiceContract to offer competitive pricing.
The most significant impact of these results is that ServiceContract was able to determine that the company had more competitive rates than both competitors and DOWC. This research opened the company’s eyes to the benefit of this kind of research, and will prompt further research in the future.
Full Text
Spring 2022
A TOOL FOR WHO WILL DROP OUT OF SCHOOL
Colette Joelle Barca, M.S. Data Science
A student’s high school experience often forms the foundation of his or her postsecondary career. As the competition in our nation’s job market continues to increase, many businesses stipulate applicants need a college degree. However, recent studies show approximately one-third of the United States’ college students never obtain a degree. Although colleges have developed methods for identifying and supporting their struggling students, early intervention could be a more effective approach for combating postsecondary dropout rates. This project seeks to use anomaly detection techniques to create a holistic early detection tool that indicates which high school students are most at risk to drop out of college. An individual’s high school experience is not confined to the academic components. As such, an effective model should incorporate both environmental and educational factors, including various descriptive data on the student’s home area, the school’s area, and the school’s overall structure and performance. This project combined this information with data on students throughout their secondary educational careers (i.e., from ninth through twelfth grade) in an attempt to develop a model that could detect during high school which students have a higher probability of dropping out of college. The clustering-based and classification-based anomaly detection algorithms detail the situational and numeric circumstances, respectively, that most frequently result in a student dropping out of college. High school administrators could implement these models at the culmination of each school year to identify which students are most at risk for dropping out in college. Then, administrators could provide additional support to those students during the following school year to decrease that risk. College administrators could also follow this same process to minimize dropout rates.
Full Text
COMPREHENSIVE ANALYSIS OF THE FUTURE PRICE OF NBA TOP SHOT MOMENTS
Miguel A. Esteban Diaz, M.S. Data Science
NBA Top Shot moments are NFTs built on the FLOW blockchain and created by Dapper Labs in collaboration with the NBA. These NFTs, commonly referred to as “moments”, consist of in-game highlights of an NBA or WNBA player. Using the different variables of a moment, like for example: the type of play done by the player appearing in the moment (dunk, assist, block, etc.), the number of listings of that moment in the marketplace, whether the player appearing in the moment is a rookie or the rarity tier of the moment (Common, Fandom, Rare or Legendary). This project aims to provide a statistical analysis that could yield hidden correlations of the characteristics of a moment and its price, and a prediction of the price of moments with the use of machine learning regression models which include linear regression, random forest or neural networks. As NFTs, and especially NBA Top Shot, are a relatively recent area of research, at the moment there is not extensive research performed about this area. This research has an intent to expand the up to date analysis and research performed in this topic and serve as a foundation for any future research in this area, as well as provide helpful and practical information about the valuation of moments, the importance of the diverse characteristics of moments and impact in the pricing of the moments and the future possible application of this information to other similar highlight-oriented sport NFTs like NFL AllDay or UFC Strike, which are designed similarly to NBA Top Shot.
Full Text
PREVENTING THE LOSS OF SKILLFUL TEACHERS: TEACHER TURNOVER PREDICTION USING MACHINE LEARNING TECHNIQUES
Nirusha Srishan, M.S. Data Science
Teacher turnover rate is an increasing problem in the United States. Each year, teachers leave their current teaching position to either move to a different school or to leave the profession entirely. In an effort to understand why teachers are leaving their current teaching positions and to help identify ways to increase teacher retention rate, I am exploring possible reasons that influence teacher turnover and creating a model to predict if a teacher will leave the teaching profession. The ongoing turnover of teachers has a vast impact on school district employees, the state, the country, and the student population. Therefore, exploring the variables that contribute to teacher turnover can ultimately lead to decreasing the rate of turnover.
This project compares those in the educational field, including general education teachers, special education teachers and other educational staff, who have completed the 1999-2000 School and Staffing Survey (SASS) and Teacher Follow-up Survey (TFS) from the National Center for Educational Statistics (NCES, n.d.). This data will be used to identify trends in teachers that have left the profession. Predictive modeling will include various machine learning techniques, including Logistic Regression, Support Vector Machines (SVM), Decision Tree and Random Forest, and K-Nearest Neighbors. By finding the reasons for teacher turnover, a school district can identify a way to maximize their teacher retention rate, fostering a supportive learning environment for students, and creating a positive work environment for educators.
Full Text
FORECASTING AVERAGE SPEED OF CALL CENTER RESPONSES
Emmanuel Torres, M.S. Data Science
Organizations use multifaceted modern call centers and are currently utilizing antiquated forecasting technologies leading to erroneous staffing during critical periods of unprecedented volume. Companies will experience financial hemorrhaging or provide an inadequate customer experience due to incorrect staffing when sporadic volume emerges. The current forecasting models being employed are being used with known caveats such as the inability for the model to handle wait time without abandonment and only considers a single call type when making the prediction.
This study aims to create a new forecasting model to predict the Average Speed of Answer (ASA) to obtain a more accurate prediction of the staffing requirements for a call center. The new model will anticipate historical volume of varying capacities to create the prediction. Both parametric and nonparametric methodologies will be used to forecast the ASA. An ARIMA (Autoregressive Integrated Moving Average) parametric model was used to create a baseline for the prediction. The application of machine learning techniques such as Recurrent Neural Networks (RNN) was used since it can process sequential data by utilizing previous outputs as inputs to create the neural network. Specifically, Long Short-Term Memory (LSTM) recurrent neural networks were used to create a forecasting model for the call center ASA.
With the LSTM neural network a univariate and multivariate approach was utilized to forecast the ASA. The findings confirm that univariate LSTM neural networks resulted in a more accurate forecast by netting the lowest Root Mean Squared Error (RMSE) score from the three methods used to predict the call center ASA. Even though the univariate LSTM model produced the best results, the multivariate LSTM model did not stray far from providing an accurate prediction but received a higher RMSE score compared to the univariate model. Furthermore, ARIMA provided the highest RMSE score and forecasted the ASA inaccurately.
Full Text
Fall 2021
A COMPREHENSIVE EVALUATION ON THE APPLICATIONS OF DATA AUGMENTATION, TRANSFER LEARNING AND IMAGE ENHANCEMENT IN DEVELOPING A ROBUST SPEECH EMOTION RECOGNITION SYSTEM
Kyle Philip Calabro, M.S. Data Science
Within this thesis work, the applications of data augmentation, transfer learning, and image enhancement techniques were explored in great depth with respect to speech emotion recognition (SER) via convolutional neural networks and the classification of spectrogram images. Speech emotion recognition is a challenging subset of machine learning with an incredibly active research community. One of the prominent challenges of SER is a lack of quality training data. The methods developed and presented in this work serve to alleviate this issue and improve upon the current state-of-the-art methodology. A novel unimodal approach was taken in which five transfer learning models pre-trained on the ImageNet data set were used with both the feature extraction and fine-tuning method of transfer learning. Such transfer learning models include the VGG-16, VGG-19, InceptionV3, Xception and ResNet-50. A modified version of the AlexNet deep neural network model was utilized as a baseline for non pre-trained deep neural networks. Two speech corpora were utilized to develop these methods. The Ryerson Audio-Visual Database of Emotional Speech and Songs (RAVDESS) and the Crowd-source Emotional Multimodal Actors dataset (CREMA-D). Data augmentation techniques were applied to the raw audio of each speech corpora to increase the amount of training data, yielding custom data sets. Raw audio data augmentation techniques include the addition of Gaussian noise, stretching by two different factors, time shifting and shifting pitch by three separate tones. Image enhancement techniques were implemented with the aim of improving classification accuracy by unveiling more prominent features in the spectrograms. Image enhancement techniques include conversion to grayscale, contrast stretching and the combination of grayscale conversion followed by contrast stretching. In all, 176 experiments were conducted to provide a comprehensive overview of all techniques that were proposed as well as a definitive methodology. Such methodology yields improved or comparable results to what is currently considered to be state-of-the-art when deployed on the RAVDESS and CREMA-D speech corpora.
Full Text