A comparative study on some techniques for fitting linear regression models to big data

Mankoe, Mathias

UCC IR Home
→
THESES & DISSERTATIONS
→
MASTERS
→
COLLEGE OF AGRICULTURAL & NATURAL SCIENCES
→
SCHOOL OF PHYSICAL SCIENCES
→
Department of Mathematics & Statistics
→
View Item

A comparative study on some techniques for fitting linear regression models to big data

Mankoe, Mathias

URI: http://hdl.handle.net/123456789/11301

Date: 2023-10

Abstract:

This study examines the applicability of two Random Projection and Merge and Reduce methods, widely used in Computer Science, for linear regression analysis of big data in Statistics. The Clarkson-Woodruff, Rademacher Matrix as well as the Merge and Reduce techniques are used as data reduction techniques before performing a linear regression analysis on big data sets. The Classical Merge and Reduce approach uses parameter estimates and standard errors as summary values. In summary statistics, the Bayesian Merge and Reduce approach uses some characteristics of the posterior distribution. The study reveals that the techniques considered in this thesis are good data reduction techniques for fitting linear regression models to big data sets. The Clarkson-Woodruff method provides faster and more reliable reduced data sets for linear regression analysis. The Merge and Reduce models better approximate the true Poisson and linear regression models provided there are enough observations per variable per block (5000 observations per block). However, for data sets with unbalanced factor variables, the Bayesian Merge and Reduce models approximate the true models better than the Classical Merge and Reduce models. The Merge and Reduce models show good approximations of the true models when outliers are evenly distributed among blocks. But the standard errors are overestimated for models without intercept terms. For uneven distribution of outliers, the Random Projection methods provide reliable results. The methods considered in this thesis are largely used in Computers Science, but they can be used for efficient linear regression analysis of big data sets.