Study: Many Archived Packages Return to CRAN

CRAN packages are archived all the time, but a large portion of them eventually gets fixed and return to CRAN. Using public data available from different resources1 on CRAN, we have found that 36% of the archived packages get unarchived at some point (Revilla 2022). The median time for these packages to return to CRAN is ~33 days.

Data quality checks

First of all we will make some checks on the data to be sure of the conclusions and inferences based on it.

There are 96 packages unarchived more times than archived. This could be because several reasons:

  • one of the previous issues with package lead to its removal.
  • the term used for annotating an event was different (Orphaned instead of archived).

We also check what is the first record that CRAN has of a package.

First action Packages
orphaned 85
removed 68
archived 21
renamed 1

First recorded action taken of a package. Looking by date most packages’ first action recorded is being added to CRAN. For some it isn’t.

If the first action recorded for a package is not its inclusion this can indicate some problems on the data that could lead to problems on the conclusions.

A special mention to the removed action: This action is usually reserved to copyright issues and it is normal that it is the first action in record for a package as previous records are removed too from CRAN (package source code).

By contrast, it isn’t expected that there is no record on CRAN of accepted packages, this could mean that a package was renamed or that it was removed. This happens for 134 packages.

Other possible explanations for this: Maintainers of the packages might have an exchange with the CRAN volunteers that lead to unarchiving the package without new accepted packages. It could also show a missing entry in the CRAN archives.

Events per package in the same date. Most packages have just one action per day. Unarchiving usually requires a new package’s version accepted the same day.
Multiple actions Events % events
1 59639 87%
2 8971 13%
3 331 0%

In total there are 423 unarchived events that doesn’t have the corresponding accepted package included event (on the same date). This affects 411 packages out of 9242 (4%).

On the contrary there are some events that are not expected to happen on the same day:

Actions that happend on the same date in a given package. Mostly a new acceptance lead to a package being unarchived. In some ocasions other actions.
Multiple actions Events %
accepted & unarchived 4279 93.367%
accepted & orphaned 121 2.640%
accepted & archived & unarchived 109 2.378%
accepted & archived 63 1.375%
archived & unarchived 7 0.153%
accepted & removed 3 0.065%
accepted & orphaned & unarchived 1 0.022%

Those with 3 different actions imply multiple revisions from the CRAN team the same day.

In total there are 620 different packages identified with problematic records/processing from 0. Some of them were found that had at two or more different issues 118. Depending on which issue they might be corrected to the best of our abilities or simply discarded depending on the issue and question we are trying to answer.

Analysis

Now that we have looked into the data quality we can start trying to answer some questions:

Summary of how long it takes packages to be unarchived

Table with summary statistics of time to get back to CRAN after being archived. Median time below 33 days.
Times archived Packages Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3317 1 days 9 days 33 days 126 days 125 days 3292 days
2 654 1 days 8 days 27 days 93 days 90 days 1949 days
3 169 1 days 7 days 25 days 89 days 77 days 882 days
4 55 1 days 4 days 17 days 69 days 66 days 652 days
5 16 1 days 3 days 13 days 28 days 48 days 93 days
6 5 3 days 17 days 27 days 73 days 63 days 254 days
7 1 2 days 2 days 2 days 2 days 2 days 2 days

summary statistics of time to get back to CRAN after being archived. Same as the previous table but in graphical format.

Return time for packages archived only once in their lifetime

Empirical distribution of the time it takes packages to get unarchived as a function of number of days since being archived on CRAN for the first time.

Return time for packages archived

Empirical distribution of the time it takes packages to get unarchived as a function of number of days since being archived on CRAN.

Packages archived and date since the previous release. The color indicates if a given package was archived multiple times: the more times it has been archived the lighter the point is.

Cumulative number of archived packages over the years

Packages actions done by the CRAN Team over time. The CRAN Team may take different actions for packages currently on e.g. archived (solid red), orphaned (dotted yellow), removed (dashed green), renamed (dashed blue), and unarchived (dotted purple). Presented is the cumulative number of such events over time on the linear (left) and the logarithmic (right) scale.

Days to return versus date when archived

Packages being archived and returning to CRAN. Each data point represents when a CRAN package was archived (horizontal axis) and when it was unarchived (vertical axis). If more than one package was archive and unarchived on the same dates, the corresponding data point is presented as a larger disk. The gray dashed line is the event horizon.

Distribution of number of days for packages to return to CRAN

Histogram of how long packages remain archived on CRAN. Each bar represents a week. Most packages return to CRAN within a month.

Packages archived over all

There have been at least 9047 packages archived from CRAN. From the total of 22901 in its whole history. Which results in 40% of all packages ever in CRAN got at one point archived.

Packages are archived multiple times. Packages archived are sometimes back on CRAN and archived again.

Most packages are not archived, but if they are mostly archived once. This is probably because 50% of those archived never get back to CRAN.

Most packages are not back to CRAN after being archived. Packages that got archived sometimes go back on CRAN.

Approximately 38% packages gets back on CRAN.

Packages resubmission

Resubmission process by date of being archived. Events in black are those we cannot say if they were not re-submitted or rejected.

Notice how there are some packages that submitted a new version to CRAN after 2020-09-12, but where archived long before (those that are in red before that date). Those maintainers might need help to get their packages back on CRAN.

Submissions are slightly faster for those that are accepted. Percentage of packages that are submitted before a given time.

Slowdown due to resubmission

Back to CRAN? Events % % submitted
Never resubmitted 2549 49.08% NA
Accepted 2498 48.09% 94%
Not fit 147 2.83% 6%

Based on a the latest data available, which is NA of the archived packages, slightly more than half of the packages try to get back to CRAN. Those that try almost all of them eventually get back to CRAN.

But how fast is the process of being back on CRAN?

Resubmission delay: % spend in review after submitting a new version till it is accepted. Only includes packages that were accepted one day later than submitted (to avoid dividing by 0)

Most packages that get back to CRAN are accepted soon after submitting the new version fixing the problems of being archived. But some (~ 21%) spend most of the time trying to pass checks to comply with CRAN policies: maybe trying to fix issues detected on submission or waiting for CRAN maintainers feedback.

Packages not addressed in time

Packages are archived because they are not addressed/corrected in time. If we look in more detail on this packages we see there are 5128 packages that failed to correct in time. Most of them where archived once but some of them (988) were archived multiple times.

Packages archived because problems were not fixed on time are mostly back.. Packages that got archived because maintainers couldn’t fix the packages on time got back on time
Table 1: Tally of reasons of packages not fixed. Many packages are not fixed and the maintainer address.
Not fixed Dependencies Maintainer address Policy violation Packages
TRUE FALSE FALSE FALSE 5776
FALSE FALSE FALSE FALSE 1445
FALSE TRUE FALSE FALSE 1257
FALSE FALSE FALSE TRUE 639
FALSE FALSE TRUE FALSE 228
TRUE FALSE TRUE FALSE 18
TRUE TRUE FALSE FALSE 6
FALSE TRUE FALSE TRUE 4
TRUE FALSE FALSE TRUE 4
FALSE TRUE TRUE FALSE 1

The first cause of archiving is packages not fixed, the second cause is not clear as it seems a mix o circumstances and difficulties parsing the cause. The third cause of archiving is due to a package it depends being archived and the fourth because the package didn’t comply with CRAN’s policy. The fifth most common reason is that the email address of the maintainer fails to receive emails.

Linked to R-releases?

More packages are archived the closer the release date is.

Packages are usually archived weeks after a release.

Packages tend to be archived right before a release or after it but not in the middle of releases.

Only in 2022 and 2023 there has been a clear trend for packages that are archived closer to next release and after release respectively:

Trend by year of packcages being archived. The closer to the left the closer they are after a release, the more to the right the closer they are archived before a R-release.

Age of archival

Time since being accepted to being archived. Most packages are archived soon after being accepted

Time since being accepted to being archived. Most packages are archived soon after being accepted

Time since last version till the package is archived. Most packages are archived shortly after a release, which might indicate that problems are only found after being accepted via additional checks not available to maintainers.

Age of packages being archived. Older packages are increasingly being archived but not at the same rate time passes.

Time since previous releases of packages being archived. Almost a constant rate except on a a specific moments.

Archived because depends on other packages

When a package is going to be archived CRAN sends an email to the maintainer which package in trouble and all the packages maintainers that depend on it. This often results in people stepping up and fixing the package. When this doesn’t happen, packages will be archived together with their dependency.

Dependencies impact Most packages archived result in another package archived.
Affected packages Times
1 438
2 145
3 81
4 42
5 33
6 24
7 12
8 3
9 2
11 3
12 2
13 2
17 6
20 2
22 2

The packages that affected more packages lead to 18 packages archived.

Archived packages due to a dependency often come back. In absolute numbers (left) and in percentage (right). Many packages that are archived result in another package being archived, which are usually back to CRAN.

Those packages that were archived were mostly back on CRAN.

Maintainers

Failing email

Sometimes the problem is with maintainer’s email.

Packages with not responsive maintainer address are archived later after the last version. The line indicates the aproximation of these two variables.

As the time increase between being archived and the failing email, this seems to indicate that maintainers are now more careful with the email given.

References

Revilla, Lluís. 2022. “Reasons Why Packages Are Archived on CRAN.” Personal blog. https://llrs.dev/post/2021/12/07/reasons-cran-archivals/.

Footnotes

  1. Data sources used are tools:::CRAN_current_db(), tools:::CRAN_archive_db(), and PACKAGES.in.↩︎