Skip to contents

The goal of butterfly is to aid in the verification of continually updating and overwritten time-series data, where we expect new values over time, but want to ensure previous data remains unchanged.

Unnoticed changes in previous data could have unintended consequences, such as invalidating DOIs, or altering future predictions if used as input in forecasting models.

This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.

Data

This packages includes a small dummy dataset, butterflycount, which contains a list of monthly dataframes of butterfly counts for a given date.

library(butterfly)
butterflycount
#> $january
#>         time count
#> 1 2024-01-01    22
#> 2 2023-12-01    55
#> 3 2023-11-01    11
#> 
#> $february
#>         time count
#> 1 2024-02-01    17
#> 2 2024-01-01    22
#> 3 2023-12-01    55
#> 4 2023-11-01    11
#> 
#> $march
#>         time count
#> 1 2024-03-01    23
#> 2 2024-02-01    17
#> 3 2024-01-01    22
#> 4 2023-12-01    55
#> 5 2023-11-01    18
#> 
#> $april
#>         time value species
#> 1 2024-04-01    12 Admiral
#> 2 2024-03-01    23 Admiral
#> 3 2024-02-01    NA Admiral
#> 4 2024-01-01    22 Admiral
#> 5 2023-12-01    55 Admiral
#> 6 2023-11-01    18 Admiral

This dataset is entirely fictional, and merely included to aid demonstrating butterfly’s functionality.

Examining datasets: loupe()

We can use butterfly::loupe() to examine in detail whether previous values have changed.

butterfly::loupe(
  butterflycount$february,
  butterflycount$january,
  datetime_variable = "time"
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-02-01    17
#>  And there are no differences with previous data.
#> [1] TRUE

butterfly::loupe(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-03-01    23
#> 
#>  The following values have changes from the previous data.
#> old vs new
#>            count
#>   old[1, ]    17
#>   old[2, ]    22
#>   old[3, ]    55
#> - old[4, ]    18
#> + new[4, ]    11
#> 
#> `old$count`: 17 22 55 18
#> `new$count`: 17 22 55 11
#> [1] FALSE

butterfly::loupe() uses dplyr::semi_join() to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. waldo::compare() is then used to compare these and provide a detailed report of the differences.

butterfly follows the waldo philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects.

Extracting unexpected changes: catch()

You might want to return changed rows as a dataframe. For this butterfly::catch()is provided.

butterfly::catch() only returns rows which have changed from the previous version. It will not return new rows.

df_caught <- butterfly::catch(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-03-01    23
#> 
#>  The following values have changes from the previous data.
#> old vs new
#>            count
#>   old[1, ]    17
#>   old[2, ]    22
#>   old[3, ]    55
#> - old[4, ]    18
#> + new[4, ]    11
#> 
#> `old$count`: 17 22 55 18
#> `new$count`: 17 22 55 11
#> 
#>  Only these rows are returned.

df_caught
#>         time count
#> 1 2023-11-01    18

Dropping unexpecrted changes: release()

Conversely, butterfly::release() drops all rows which had changed from the previous version. Note it retains new rows, as these were expected.

df_released <- butterfly::release(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-03-01    23
#> 
#>  The following values have changes from the previous data.
#> old vs new
#>            count
#>   old[1, ]    17
#>   old[2, ]    22
#>   old[3, ]    55
#> - old[4, ]    18
#> + new[4, ]    11
#> 
#> `old$count`: 17 22 55 18
#> `new$count`: 17 22 55 11
#> 
#>  These will be dropped, but new rows are included.

df_released
#>         time count
#> 1 2024-03-01    23
#> 2 2024-02-01    17
#> 3 2024-01-01    22
#> 4 2023-12-01    55

However, you do have the option to exclude new rows as well with the argument include_new set to FALSE.

df_release_without_new <- butterfly::release(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time",
  include_new = FALSE
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-03-01    23
#> 
#>  The following values have changes from the previous data.
#> old vs new
#>            count
#>   old[1, ]    17
#>   old[2, ]    22
#>   old[3, ]    55
#> - old[4, ]    18
#> + new[4, ]    11
#> 
#> `old$count`: 17 22 55 18
#> `new$count`: 17 22 55 11
#> 
#>  These will be dropped, along with new rows.

df_release_without_new
#>         time count
#> 1 2024-02-01    17
#> 2 2024-01-01    22
#> 3 2023-12-01    55

Using butterfly in a data processing pipeline

If you would like to know more about using butterfly in an operational data processing pipeline, please refer to the article on using butterfly in an operational pipeline.

A note on controlling verbosity

Although verbosity is mostly the purpose if this package, should you wish to silence messages and warnings, you can do so with options(rlib_message_verbosity = "quiet") and options (rlib_warning_verbosity = "quiet").

Rationale

There are a lot of other data comparison and QA/QC packages out there, why butterfly?

This package was originally developed to deal with ERA5’s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question.

Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for 09/21 - 12/21, and 07/24) force a recalculation, meaning previously published data differs from the final product.

When publishing ERA5-derived datasets, and minting it with a DOI, it is possible to continuously append without invalidating that DOI. However, recalculation would overwrite previously published data, thereby forcing a new publication and DOI to be minted.

We use the functionality in this package in an automated data processing pipeline to detect changes, stop data transfer and notify the user.

This package has intentionally been generalised to accommodate other, but similar, use cases. Other examples could include a correction in instrument calibration, compromised data transfer or unnoticed changes in the parameterisation of a model.