butterfly: An R package for the verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged.
23 October 2024
Source:vignettes/articles/butterfly_paper.Rmd
butterfly_paper.Rmd
#> left out bibliography: paper.bib from yaml
Summary
Previously recorded data could be revised after initial publication number of reasons, such as discovery of an inconsistency or error, a change in methodology or instrument re-calibration. When using other data to generate your own, it is crucial to reference the exact version of the data used, in order to maintain data provenance. Unnoticed changes in previous data could have unintended consequences, such as invalidating a published dataset’s Digital Object Identfier (DOI), or altering future predictions if used as input in forecasting models.
But what if you are not aware of upstream changes to your input data? Monitoring data sources for these changes is not always possible. Here we present butterfly, an R package for the verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged.
The intention of butterfly is to check for changes in previously published data, and warn the user with a report that contains as much details as possible. This will allow them to stop unintended data transfer, revise their published data, release a new version and communicate the significance of the change to their users.
Statement of Need
Importance of citing exact extract of data (Klump et al. 2021)
Semantic versioning is widely adopted in research software (Preston-Werner 2013)
Generating a derived data product
A key recommendation in Siddorn et al.’s (2022) report “An Information Management Framework for Environmental Digital Twins (IMFe)…
data provenance must be maintained
data quality frameworks
clearly documented for users and available in machine-readable format
tools and methods
… for a FAIR implementation (Wilkinson et al. 2016).
At the British Antarctic Survey (BAS), we developed this package to deal with a very specific issue.
Quality assurance in continually updating and continually published ERA5-derived data.
At BAS, we frequently use ERA5 (Hersbach et al. 2023) as an input to climate models.
IceNet a sea ice prediction system based on deep learning (Andersson et al. 2021)
ERA5-derived data.
The issue with ERA5 and ERA5-Interim
This package was originally developed to deal with ERA5’s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question.
Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for 09/21 - 12/21, and 07/24) force a recalculation, meaning previously published data differs from the final product.
In most cases, this is not an issue. For static data publications which are a snapshot in time, such as data associated with a specific paper, as in “Forecasts, neural networks, and results from the paper: ‘Seasonal Arctic sea ice forecasting with probabilistic deep learning’” Andersson & Hosking (2021)[@Andersson_2021] or time period as in “Downscaled ERA5 monthly precipitation data using Multi-Fidelity Gaussian Processes between 1980 and 2012 for the Upper Beas and Sutlej Basins, Himalaya” Tazi (2023), this is not an issue. These datasets clearly describe a version and time period of ERA5 from which the data were derived, and will not be amended or updated in the future, even if ERA5 is recalculated.
In our case however we want to continually append ERA5-derived datasets and continually publish them. This would be useful when functioning as a data source for an environmental digital twin (Blair & Henrys et al. 2023), or simply as input data into an environmental forecasting model which itself is frequently running.
Continually appending and publishing will require strict quality assurance. If a published dataset is only appended a DOI can be minted for it. However, if the previously published data change, this will then invalidate the DOI. For example, if you developed your code to find a better measure (more accurate, more precise) of the low pressure region, and wanted to reanalyse the previous data and republish.
One such ERA5-derived dataset which we (will hopefully soon!) publish at BAS is the Amundsen Sea Low Index (ASLI).
What is the Amundsen Sea Low Index
The Amundsen Seas Low (ASL) is a highly dynamic and mobile climatological low pressure system located in the Pacific sector of the Southern Ocean. In this sector, variability in sea-level pressure is greater than anywhere in the Southern Hemisphere, making it challenging to isolate local fluctuations in the ASL from larger-scale shifts in atmospheric pressure. The position and strength of the ASL are crucial for understanding regional change over West Antarctica (Hosking et al. 2016).
Unexpected changes in models
This package was originally developed to deal with ERA5’s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question.
Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for 09/21 - 12/21, and 07/24) force a recalculation, meaning previously published data differs from the final product.
When publishing ERA5-derived datasets, and minting it with a DOI, it is possible to continuously append without invalidating that DOI. However, recalculation would overwrite previously published data, thereby forcing a new publication and DOI to be minted.
We use the functionality in this package in an automated data processing pipeline to detect changes, stop data transfer and notify the user.
Unexpected changes in data acquisition
Measuring instruments can have different behaviours when they have a power failure. For example, during power failure an internal clock could reset to “1970-01-01”, or the manufacturing date (say, “2021-01-01”). If we are automatically ingesting and processing this data, it would be great to get a head’s up that a timeseries is no longer continuous in the way we expect it to be. This could have consequences for any calculation happening downstream.
To prevent writing different ways of checking for this depending on
the instrument, we wrote butterfly::timeline()
.
Variable measurement frequencies
In other cases, a non-continuous timeseries is intentional, for example when there is temporal variability in the measurements taken depending on events. At BAS, we collect data from a penguin weighbridge on weighbridge on Bird Island, South Georgia. This weighbridge measure weight on two different load cells (scales) to determine penguin weight and direction.
You can read about this work in more detail in Afanasyev et al. (2015), but the important point here is that the weighbridge does not collect continuous measurement. When no weight is detected on the load cells, it only samples at 1hz, but as soon as any change in weight is detected it will start collecting data at 100hz. This is of course intentional, to reduce the sheer volume of data we need to process, but also has another benefit in isolating (or attempting to) individual crossings.
The individual crossings are the most valuables pieces of data, as these allow us to deduce some sort of information like weight, direction (from colony to sea, or sea to colony) and hopefully ultimately, diet.
In this case separating distinct, but continuous segments of data is
required. This is the reasoning behind timeline_group()
.
This function allows us to split our timeseries in groups of individual
crossings.
References
Afanasyev V, Buldyrev SV, Dunn MJ, Robst J, Preston M, et al. (2015) Increasing Accuracy: A New Design and Algorithm for Automatically Measuring Weights, Travel Direction and Radio Frequency Identification (RFID) of Penguins. PLOS ONE 10(4): e0126292. https://doi.org/10.1371/journal.pone.0126292
Andersson, T., & Hosking, J. (2021). Forecasts, neural networks, and results from the paper: ‘Seasonal Arctic sea ice forecasting with probabilistic deep learning’ (Version 1.0) [Data set]. NERC EDS UK Polar Data Centre. https://doi.org/10.5285/71820e7d-c628-4e32-969f-464b7efb187c
Andersson, T.R., Hosking, J.S., Pérez-Ortiz, M. et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat Commun 12, 5124 (2021). https://doi.org/10.1038/s41467-021-25257-4
Blair, Gordon S., and Peter A. Henrys. 2023. “The Role of Data Science in Environmental Digital Twins: In Praise of the Arrows.” Environmetrics 34 (January): Not available. https://doi.org/10.1002/env.2789.
Hersbach, H., Bell, B., Berrisford, P., Biavati, G., Horányi, A., Muñoz Sabater, J., Nicolas, J., Peubey, C., Radu, R., Rozum, I., Schepers, D., Simmons, A., Soci, C., Dee, D., Thépaut, J-N. (2023): ERA5 hourly data on single levels from 1940 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS), DOI: 10.24381/cds.adbb2d47
Hosking, J. S., A. Orr, T. J. Bracegirdle, and J. Turner (2016), Future circulation changes off West Antarctica: Sensitivity of the Amundsen Sea Low to projected anthropogenic forcing, Geophys. Res. Lett., 43, 367–376, doi:10.1002/2015GL067143.
Klump, J., Wyborn, L., Wu, M., Martin, J., Downs, R.R. and Asmi, A. (2021) ‘Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles’, Data Science Journal, 20(1), p. 12. Available at: https://doi.org/10.5334/dsj-2021-012.
Preston-Werner, T. 2013. Semantic Versioning 2.0.0. Semantic Versioning. Available at https://semver.org/spec/v2.0.0.html [Last accessed 28 October 2024].
Siddorn, John, Gordon Shaw Blair, David Boot, Justin James Henry Buck, Andrew Kingdon, et al. 2022. “An Information Management Framework for Environmental Digital Twins (IMFe).” Zenodo. https://doi.org/10.5281/ZENODO.7004351.
Tazi, K. (2023). Downscaled ERA5 monthly precipitation data using Multi-Fidelity Gaussian Processes between 1980 and 2012 for the Upper Beas and Sutlej Basins, Himalayas (Version 1.0) [Data set]. NERC EDS UK Polar Data Centre. https://doi.org/10.5285/b2099787-b57c-44ae-bf42-0d46d9ec87cc
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1). https://doi.org/10.1038/sdata.2016.18.