Study: Many Archived Packages Return to CRAN

CRAN packages are archived all the time, but a large portion of them eventually gets fixed and return to CRAN. Using public data available from different resources1 on CRAN, we have found that 36% of the archived packages get unarchived at some point (Revilla 2022). The median time for these packages to return to CRAN is ~33 days.

Data quality checks

To make sure our assumptions about the raw input data is valid, we will run some initial quality checks based on the data that are available as of 2024-07-06.

A package should never unarchived more times than it is archive. However, there are currently 94 packages unarchived more times than archived. There could be several reasons for this:

  • one of the previous issues with the package lead to its removal.
  • the term used for annotating an event was different (‘orphaned’ instead of ‘archived’).
First action Packages
orphaned 86
removed 68
archived 21
renamed 1

Figure: First recorded action taken of a package. Looking by date most packages’ first action recorded is being added to CRAN. For some it isn’t.

We also check what the first recorded event packages have. If the first action recorded for a package is not that it is ‘accepted’, this can indicate some problems on the data that could lead to problems on the conclusions.

A special mention of the ‘removed’ action: This action is usually reserved to copyright issues and it is normal that it is the first action in record for a package as previous records are removed too from CRAN (package source code).

By contrast, we should not expected lack of records on CRAN of ‘accepted’ packages. Based on the current data, this is the case for 135 packages. This could indicated that packages have been ‘renamed’ or ‘removed’. Another explanations could be that there was a dialogue between the package maintainers and the CRAN Team that lead to the package being ‘unarchived’ without new ‘accepted’ packages. It could be because of a missing entry in the CRAN data.

Table: Events per package in the same date. Most packages have just one action per day. Unarchiving usually requires a new package’s version accepted the same day.
Multiple actions Events % events
1 60011 86%
2 9101 13%
3 331 0%

In total there are 425 ‘unarchived’ events that do not have the corresponding ‘accepted’ package included event (on the same date). Currently, this is the case for 413 packages out of 9284 (4%).

On the contrary, there are some events that are not expected to happen on the same day:

Table: Actions that happend on the same date in a given package. Mostly a new acceptance lead to a package being unarchived. In some occasions other actions.
Multiple actions Events %
accepted & unarchived 4344 93.460%
accepted & orphaned 121 2.603%
accepted & archived & unarchived 109 2.345%
accepted & archived 63 1.355%
archived & unarchived 7 0.151%
accepted & removed 3 0.065%
accepted & orphaned & unarchived 1 0.022%

Those with three different actions imply that there has been multiple revisions from the CRAN Team on the same day.

In total there are 622 different packages identified with problematic records/processing from 9284. Out of these, 118 were found to have two or more different issues. Depending on which issue, they might be corrected to the best of our abilities, or simply be discarded depending on the issue and question we are trying to answer.

Analysis

Now that we have looked into the data quality, we can start trying to answer some questions:

Summary of how long it takes packages to be unarchived

Table: Summary statistics of time to get back to CRAN after being archived. Median time is 33 days.
Times archived Packages Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3361 1 days 9 days 33 days 128 days 125 days 3292 days
2 662 1 days 8 days 28 days 92 days 90 days 1949 days
3 175 1 days 7 days 27 days 92 days 86 days 882 days
4 59 1 days 5 days 18 days 73 days 82 days 652 days
5 16 1 days 3 days 13 days 28 days 48 days 93 days
6 5 3 days 17 days 27 days 73 days 63 days 254 days
7 1 2 days 2 days 2 days 2 days 2 days 2 days

Figure: Summary statistics of time to get back to CRAN after being archived. Graphical representation of the previous table.

Return time for packages archived only once in their lifetime

Figure: Empirical distribution of the time it takes packages to get unarchived as a function of number of days since being archived on CRAN for the first time.

Return time for packages archived

Figure: Empirical distribution of the time it takes packages to get unarchived as a function of number of days since being archived on CRAN.

Figure: Packages archived and date since the previous release. The color indicates if a given package was archived multiple times; the more times it has been archived the lighter the point is.

Cumulative number of archived packages over the years

Figure: Packages actions done by the CRAN Team over time. The CRAN Team may take different actions for packages currently on e.g. archived (solid red), orphaned (dotted yellow), removed (dashed green), renamed (dashed blue), and unarchived (dotted purple). Presented is the cumulative number of such events over time on the linear (left) and the logarithmic (right) scale.

Days to return versus date when archived

Figure: Packages being archived and returning to CRAN. Each data point represents when a CRAN package was archived (horizontal axis) and when it was unarchived (vertical axis). If more than one package was archive and unarchived on the same dates, the corresponding data point is presented as a larger disk. The gray dashed line is the event horizon.

Distribution of number of days for packages to return to CRAN

Figure: Histogram of how long packages remain archived on CRAN. Each bar represents a week. Most packages return to CRAN within a month.

Packages archived over all

There have been at least 9089 packages archived from CRAN. From the total of 23058 in its whole history. Which results in 39% of all packages ever in CRAN got at one point archived.

Figure: Packages are archived multiple times. Packages archived are sometimes back on CRAN and archived again.

Most packages are not archived, but if they are mostly archived once. This is probably because 50% of those archived never get back to CRAN.

Figure: Most packages are not back to CRAN after being archived. Packages that got archived sometimes go back on CRAN.

Approximately 38% packages gets back on CRAN.

Packages resubmission

Figure: Resubmission process by date of being archived. Events in black are those we cannot say if they were not re-submitted or rejected. The dashed gray lines are the dates of R minor releases.

Notice how there are some packages that submitted a new version to CRAN after 2020-09-12, but where archived long before (those that are in red before that date). Those maintainers might need help to get their packages back on CRAN.

Figure: Submissions are slightly faster for those that are accepted. Percentage of packages that are submitted before a given time.

Slowdown due to resubmission

Back to CRAN? Events % % submitted
Accepted 2558 48.696% 95%
Never resubmitted 2557 48.677% NA
Not accepted 138 2.627% 5%

Based on a the latest data available, which is NA of the archived packages, slightly more than half of the packages try to get back to CRAN. Those that try almost all of them eventually get back to CRAN.

But how fast is the process of being back on CRAN?

Figure: Resubmission delay: % spend in review after submitting a new version till it is accepted. Only includes packages that were accepted one day later than submitted (to avoid dividing by 0)

Most packages that get back to CRAN are accepted soon after submitting the new version fixing the problems of being archived. But some (~ 21%) spend most of the time trying to pass checks to comply with CRAN policies: maybe trying to fix issues detected on submission or waiting for CRAN maintainers feedback.

Packages not addressed in time

Packages are archived because they are not addressed/corrected in time. If we look in more detail on this packages we see there are 5150 packages that failed to correct in time. Most of them where archived once but some of them (1007) were archived multiple times.

Figure: Packages archived because problems were not fixed on time are mostly back.. Packages that got archived because maintainers couldn’t fix the packages on time got back on time
Table 1: Table: Tally of reasons of packages not fixed. Many packages are not fixed and the maintainer address.
Not fixed Dependencies Maintainer address Policy violation Packages
TRUE FALSE FALSE FALSE 5808
FALSE FALSE FALSE FALSE 1458
FALSE TRUE FALSE FALSE 1270
FALSE FALSE FALSE TRUE 644
FALSE FALSE TRUE FALSE 228
TRUE FALSE TRUE FALSE 18
TRUE TRUE FALSE FALSE 6
FALSE TRUE FALSE TRUE 4
TRUE FALSE FALSE TRUE 4
FALSE TRUE TRUE FALSE 1

The first cause of archiving is packages not fixed, the second cause is not clear as it seems a mix o circumstances and difficulties parsing the cause. The third cause of archiving is due to a package it depends being archived and the fourth because the package didn’t comply with CRAN’s policy. The fifth most common reason is that the email address of the maintainer fails to receive emails.

Linked to R-releases?

Figure: More packages are archived the closer the release date is. On the left side packages archived long before the next R minor release, on the right side packages archived closer to R release minor colored by their submission process after being archived.

Figure: Packages are usually archived weeks after a release. On the left side packages archived after a R minor release, on the right side packages archived long after a R release minor colored by their submission process after being archived.

Figure: Packages tend to be archived right before a release or after it but not in the middle of releases. On the left side packages archived after a R minor release, on the right side packages archived before the next a R release minor colored by their submission process after being archived.

Only in 2022 and 2023 there has been a clear trend for packages that are archived closer to next release and after release respectively:

Figure: Trend by year of packcages being archived. The closer to the left the closer they are after a release, the more to the right the closer they are archived before a R-release.

Age of archival

Figure: Time since being accepted to being archived. Most packages are archived soon after being accepted. There are packages that keep the initial version for several years without problems.

Figure: Time since being accepted for the first time to being archived. Most packages are archived soon after being accepted for the first time.

Packages are often archived after being first accepted. There is a peak of archived packages 2 weeks after the acceptance, but there are also packages archived before the usually 2 weeks period to fix issues. Passing the first month seems critical for packages as the rates later on seems more stable.

Figure: Time since last version till the package is archived. Most packages are archived shortly after a release, which might indicate that problems are only found after being accepted via additional checks not available to maintainers.

Packages that were already on CRAN are archived sooner after a new release. This matches the trend on first time accepted packages. If anything the trend to archive packages soon after being accepted is higher.

Figure: Age of packages being archived. Older packages are increasingly being archived but not at the same rate time passes.

Age of packages archived is increasing, sometimes changes in TODO

Figure: Time since previous releases of packages being archived. Almost a constant rate except on a a specific moments.

Packages archived keep updating to an almost fixed rate.

Archived because depends on other packages

When a package is going to be archived CRAN sends an email to the maintainer which package in trouble and all the packages maintainers that depend on it. This often results in people stepping up and fixing the package. When this doesn’t happen, packages will be archived together with their dependency.

Table: Dependencies impact. Most packages archived result in another package archived.
Affected packages Times
1 439
2 148
3 81
4 42
5 34
6 24
7 12
8 1
9 4
11 3
12 2
14 2
17 6
20 2
22 2

The packages that affected more packages lead to 18 packages archived.

Figure: Archived packages due to a dependency often come back. In absolute numbers (left) and in percentage (right). Many packages that are archived result in another package being archived, which are usually back to CRAN.

Those packages that were archived were mostly back on CRAN.

Maintainers

Failing email

Sometimes the problem is with maintainer’s email.

Figure: Packages with not responsive maintainer address are archived later after the last version. The line indicates the approximation of these two variables.

As the time increase between being archived and the failing email, this seems to indicate that maintainers are now more careful with the email given.

References

Revilla, Lluís. 2022. “Reasons Why Packages Are Archived on CRAN.” Personal blog. https://llrs.dev/post/2021/12/07/reasons-cran-archivals/.

Footnotes

  1. Data sources used are tools:::CRAN_current_db(), tools:::CRAN_archive_db(), and PACKAGES.in. The first holds a data frame of packages currently on CRAN, which information on the package name, the package version, and the publishing timestamp. The second holds a list of data frames, each comprising the same package information for all versions ever published on CRAN, except the currently available version. The third, holds information on events for packages that have ever been archived, removed, orphaned, etc.↩︎