Large-scale environmental data science with ExaGeoStatR

Sameh Abdulah, Yuxiao Li, Jian Cao, Hatem Ltaief, David E. Keyes, Marc G. Genton*, Ying Sun

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Parallel computing in exact Gaussian process (GP) calculations becomes necessary for avoiding computational and memory restrictions associated with large-scale environmental data science applications. The exact evaluation of the Gaussian log-likelihood function requires (Formula presented.) storage and (Formula presented.) operations, where (Formula presented.) is the number of geographical locations. Thus, exactly computing the log-likelihood function with a large number of locations requires exploiting the power of existing parallel computing hardware systems, such as shared-memory, possibly equipped with GPUs, and distributed-memory systems, to solve this exact computational complexity. In this article, we present ExaGeoStatR, a package for exascale geostatistics in R that supports a parallel computation of the exact maximum likelihood function on a wide variety of parallel architectures. Furthermore, the package allows scaling existing GP methods to a large spatial/temporal domain. Prohibitive exact solutions for large geostatistical problems become possible with ExaGeoStatR. Parallelization in ExaGeoStatR depends on breaking down the numerical linear algebra operations in the log-likelihood function into a set of tasks and rendering them for a task-based programming model. The package can be used directly through the R environment on parallel systems without the user needing any C, CUDA, or MPI knowledge. Currently, ExaGeoStatR supports several maximum likelihood computation variants such as exact, diagonal super tile and tile low-rank approximations, and mixed-precision. ExaGeoStatR also provides a tool to simulate large-scale synthetic datasets. These datasets can help assess different implementations of the maximum log-likelihood approximation methods. Herein, we show the implementation details of ExaGeoStatR, analyze its performance on various parallel architectures, and assess its accuracy using synthetic datasets with up to 250K observations. The experimental analysis covers the exact computation of ExaGeoStatR to demonstrate the parallel capabilities of the package. We provide a hands-on tutorial to analyze a sea surface temperature real dataset. The performance evaluation involves comparisons with the popular packages GeoR, fields, and bigGP for exact Gaussian likelihood evaluation. The approximation methods in ExaGeoStatR are not considered in this article since they were analyzed in previous studies.

Original languageEnglish (US)
Article numbere2770
JournalEnvironmetrics
Volume34
Issue number1
DOIs
StatePublished - Feb 2023

Bibliographical note

Publisher Copyright:
© 2022 John Wiley & Sons Ltd.

Keywords

  • environmental application
  • Gaussian process
  • Matérn covariance function
  • maximum likelihood optimization
  • parameter estimation
  • prediction

ASJC Scopus subject areas

  • Statistics and Probability
  • Ecological Modeling

Fingerprint

Dive into the research topics of 'Large-scale environmental data science with ExaGeoStatR'. Together they form a unique fingerprint.

Cite this