mastodon.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
The original server operated by the Mastodon gGmbH non-profit

Administered by:

Server stats:

380K
active users

Reproducibility for the long-term is hard. Even harder if you don't own the data. I was looking over some R scripts from ~2007/09 that I updated ~2016/17 and found broken links for downloading the MIRCA cropping data. The data sets moved from an FTP server to Zenodo (yay) but they just fail when requested in the script (boo). This isn't anything important or urgent, just my PhD dissertation and I'm tinkering, but it just highlights how fragile our systems are for creating #ReproducibleResearch

@adamhsparks Yeah long-term preservation of data set is an underestimated issue, IMHO. We had hard time when finding an example for this paper nature.com/articles/s41597-022

Using #guix, it's possible to work around “code” for the computational part (R or Julia or else) relying on softwareheritage.org

However, data set…

NatureToward practical transparent verifiable and long-term reproducible research using Guix - Scientific DataReproducibility crisis urge scientists to promote transparency which allows peers to draw same conclusions after performing identical steps from hypothesis to results. Growing resources are developed to open the access to methods, data and source codes. Still, the computational environment, an interface between data and source code running analyses, is not addressed. Environments are usually described with software and library names associated with version labels or provided as an opaque container image. This is not enough to describe the complexity of the dependencies on which they rely to operate on. We describe this issue and illustrate how open tools like Guix can be used by any scientist to share their environment and allow peers to reproduce it. Some steps of research might not be fully reproducible, but at least, transparency for computation is technically addressable. These tools should be considered by scientists willing to promote transparency and open science.

@zimoun @adamhsparks

In my opinion, is an overkill for anything except computer science research. I think an approach that has better chances to succeed is reusing the OCI ecosystem and in particular using for a per-file deduplication, that is important for large datasets. The missing piece for me it's a package manager for knowledge i.e. that links papers, datasets, source code for both calculations and documents (LaTeX, etc) in a graph of dependencies, citations etc.

@alxlg I challenge the assertion that “Guix is overkill” for your research. What would that even mean? That it makes your computational workflow “too” reproducible⁈

Not the first time I read that. I tried to answer this question, namely “Is reproducibility practical?” 👇
hpc.guix.info/blog/2022/07/is-

Also, there may be other ways to implement that, but opaque container images are not one of them.

@adamhsparks @zimoun

hpc.guix.infoGuix-HPC — Is reproducibility practical?
Alex L 🕊 🇵🇸

@civodul @adamhsparks @zimoun

> To explore the behavior of the code, we need more. Guix eases exploration with “package transformation options”, which let users deploy variants of the software environment, for example by applying a patch somewhere in the software stack or swapping one dependency for another. A “frozen” application bundle such as a Docker image does not provide this lever.

Why? And what do you mean by "opaque" container image?

@alxlg A container image is a bunch of bytes, itself the result of a complex computational process: running ‘apt install’, building software, etc.

If all you have are those bytes, you cannot tell where they come from—just like when given a cake, you can at best guess what ingredients it contains, but you cannot tell whether it contains Novichok nor derive its recipe.

@zimoun @adamhsparks

@adamhsparks Likewise, with a Docker image, it’s hard or impossible to do more than just run the code that’s in there.

You’d like to run that code with a different version of one of its dependency? Or built with a different flag? Or with a different implementation of its algorithm? You’re on your own: the container image doesn’t support that kind of experimentation; it’s “frozen”.

@zimoun @alxlg

@civodul @zimoun @adamhsparks

And...? Why are you picking an intermediate artifact of the whole OCI ecosystem? Images are shipped for convenience but they are built using Dockerfiles. And if you don't like Dockerfiles there are other ways that could better fit your use case. Here there is also a demo taking advantage of Nix:

grahamc.com/blog/nix-and-layer

grahamc.comOptimising Docker Layers for Better Caching with Nix - Graham Christensen

@alxlg That’s the point: for an image built with Nix or with ‘guix pack’, you can have good provenance tracking (Guix has ‘--save-provenance’ to record that info in the image).

Conversely, an image built with a ‘Dockerfile’ is not reproducible, and thus not verifiable. It’s not reproducible because it depends on lots of external state such as Debian/Ubuntu/Alpine/PyPI servers.

@adamhsparks @zimoun

@zimoun @adamhsparks @civodul

Again, exactly the same question I made above. Why are you assuming the OCI image is the only thing that gets distributed? I never said that, images are meant for deployment, they are distributed for convenience just like you can distribute a single binary. Of course you need the source code and whatever is used to build it and eventually to build the image or other artifacts meant to make easier to *run* the code.