Abstract Erica SawyerGreg JonesDiego TomasiLuca Brillante

Comparison of Imputation Methods for Long and Varied Grapevine Phenology Data Sets

Erica Sawyer, Greg Jones, Diego Tomasi, and Luca Brillante*
*Department of Viticulture and Enology, California State University Fresno, 2360 E Barstow Ave, Fresno, CA, 93740 (lucabrillante@csufresno.edu)

Tracking changes in phenological data through time is a vital part of understanding the development of grapevines and plant response to various climate factors.  Handling missing data values remains an ongoing challenge in the field of data science, which motivates this project to evaluate several methods of imputation on incomplete phenological time series data. Imputation is a favorable alternative to removing missing values, as the resulting data set is more substantial and retains the dimensions of the original data set. Furthermore, time-series analysis methods often demand complete series. The data set considered here originated from Conegliano, Italy, and contains phenological data for over 400 grape varieties collected at budbreak, flowering, veraison, and ripening since 1964. Depending upon the year and phenological stage, missing data values may make up anywhere between 10% and 70% of the measurements for a given year. This work used three leading imputation methods: K-nearest neighbor (KNN), multivariate imputation by chained equations (MICE), and missForest. Roughly 15% of each data set was held out as a test set, where original data values are replaced with artificial missing values. After the application of imputation methods, imputed values were compared for data points in the test set to assess the accuracy of each method. The imputed data sets with the lowest test root mean squared error for all phenological stages were produced using missForest and had average errors of three days for budbreak, four days for flowering, seven days for veraison, and 10 days for ripening on the test set. This work paves the way for future analysis with the imputed Conegliano phenological time series data set.

Funding Support: No funds identified