Zhang et al 2022: "Corporate Dominance in Open Source Ecosystems: A Case Study of OpenStack" http://dx.doi.org/10.1145/3540250.3549117 "We find evidence of company domination in >73% of the repositories in OpenStack…We identify five patterns of corporate dominance: Early incubation, Full-time hosting, Growing domination, Occasional domination, and Last remaining. We find that domination has a significantly negative relationship with the survival probability of OSS projects." #nwit
@gvwilson Interesting, thanks!
I find the focus on "survival" a bit surprising (but it's relatively easy to observe). Once a project is dominated by a corporation, it becomes less and less useful to its other users, so in the end it doesn't matter so much if it survives as a for-corporate tool or not.
@gvwilson I'd love to see a more general study of how OS project evolve as the composition of their community changes over time.
What I have seen happening with the SciPy ecosystem is an early focus on growth for ensuring sustainability, which lead to new participants having different objectives and thus a change of focus of the projects. Today, SciPy no longer serves the people for who it was initially created: individual researchers and small teams.
@frankkrueger Not SciPy the library, but the ecosystem as a whole. It started with Numerical Python, first published in 1995.
Maybe the archived MatrixSIG mailing list has some explicit references to goals or objectives. I am drawing on my memories from those days (I was a founding member of the MatrixSIG).
I'll add that one strength in early Python was the ease of creating C extensions. Now that once-core capability is Somebody Else's Problem (a PEP 517 backend).
The focus shift to isolated environments and reproducible builds might(?) be more appropriate for corporate concerns, but have been a downright nuisance for me.
For example, I've had to patch around changes that hinder my workflow, like how isolation hinders incremental re-builds, while my full builds take >1 min.
@dalke That's a good illustration, also for the influence from the wider Python ecosystem that went the same route towards professionalisation (trying to find a less loaded term than corporate domination).
Reminder: Python started out as a scripting language, in 1991. Today it has become unclear how many package managers it has or needs. I don't write small scripts in Python anymore, the overhead of environment management is too much.
@dalke @khinsen @gvwilson The sad thing is that Python doesn’t even do isolated environments well… It should be the default and only way to set up a project. Instead, there are more tools for it than I can count, and it’s a skill to even know how to choose one. Even relatively good tools like Conda don’t give you a lock file by default (not difficult to generate once you learn the command, but you have to know you need it and know to look for the command that does it).
Compare this to Julia which has a single standard package and environment manager. The only thing to do is declare that the current directory will now contain a project, and you have the important things automatically taken care of by the tool. People using the project downstream of you can then choose to use its dependencies in their newest available versions (tell the package manager to instantiate the Project.toml file), or the same pinned versions you used at the time you worked on the project (instantiate Manifest.toml, which is how Julia calls the lock file).
That insistence on isolated environments is exactly my point about how things have changed to make things harder for some.
Can you tell Julia's Project.toml that you depend on a third-party binary (like VMD at https://www.ks.uiuc.edu/Research/vmd/), and a manually built shared lib, when those aren't known to Julia?
I know some love lock files, but I've not used them as my research projects have essentially no dependencies. I set up a venv and use it for a couple of years.
@dalke @khinsen @gvwilson Sorry, I got carried away talking about Julia. I don't know exactly how it deals with external dependencies because I haven't needed them (at least not directly). This is presumably explained here (and in the documentation linked therein): https://binarybuilder.org ; a quick glance suggests that you first need to write a package definition for the dependencies, so the simple answer to your question seems to be "no".
What I meant was that, in addition to making life more difficult for people who don't need isolated environment, the tools attempting to do this in Python also don't achieve proper isolation smoothly/easily for people who want it. So this seems like two ways Python is worse for both groups of users...
@dalke @khinsen @gvwilson That said, I would be surprised if everything went smoothly with your venv after a couple of years. Even if you never need any dependency, what happens when you want to share the project with someone else? Without an explicit description of the environment (a lock file), they will get whatever versions of the base components are in use today, which likely won't be the same as in your venv and may or may not run your program exactly as you intended, because everything changes all the time. Python is notorious for this. I have seen cases where installing the same Python program in a fresh conda environment just a few months apart resulted in some deprecation warnings being printed along with the results, because something changed in one of the dependencies (the program itself had not changed, same git tag was installed both times).
It isn't smooth, but *lock files don't solve the issue* because I have external dependencies. For example, I compile Open Babel from source, which needs CMake, SWIG, libxml2, eigen2, and Cairo;& I use the jpype bridge to access the CDK and zstd Java jars.
The point is that the "professionalization" needed for no-touch re-usable components is far higher than needed for most research code, which often is never shared, or only meant for a single very specific case.
I'm reacting more to your 'It should be the ... only way to set up a project.'.
VMD, a non-FOSS package, must be downloaded from the vendor.
With a non-isolated system, make an environment, install in it, and (re)use it for years.
With an isolated system, you must set up a local package server and figure out how to adapt the download to it.
Isolation requires more "professionalization" so will exclude some researchers. It may bring new ones, but is not an unalloyed good.
@dalke @Guillawme My "should" in this space is: Language-specific package and environment managers should not exist.
All programming tools should start from the assumption that real-life projects combine pieces written in different languages, and provide support for interfacing and bridging. Starting at the language level, and up to the environment management level.
Lack of consideration for this basic principle is one reason why Python ended up with so many package managers.
@dalke @Guillawme On the other hand, this moral obligation to think about integration applies to everyone offering software, including proprietary. Downloading proprietary binaries should still be automatable, with stable references to precise versions.
I know your idealism, but nothing like that exists. I doubt it ever will.
Consider https://hg.sr.ht/~dalke/rectfillcurve where I converted the C macros from doi:10.1109/BigData50022.2020.9378385 into Python iterators.
Consider that I turned the main() of https://www.ics.uci.edu/~eppstein/numth/frap.c into a function and fixed numerical issues.
I once used SWIG to bridge from C to Python. The hard part was making it OO and GC friendly.
These adaptations will always require blood, sweat, tears, and tedium.
@dalke Yes, certainly, there will always be rough cases. I just want them to become much more exceptional than today. Most tasks for most people should be smooth, predictable, and comprehensible.
It's really hard for me to understand the basis for this moral obligation. Note that I am funded by commercial software development and consulting, while the two of you are academics.
Why am I morally obligated? How long does that obligation last? Can I ever be rid of that obligation? How much do I need to think about it? Should this morality be taught in school? What are the consequences of failure?
Are Microsoft and all projects on GitHub equally morally obligated?
@dalke The basis is being a good citizen of the realm of scientific research. That's what moral obligations are about.
The obligation comes from community consensus. On this specific topic, there is none yet, but I am convinced that it will be coming due to the increasing pressure towards Open Science practices.
Being moral rather than legal, the only penalty for non-conformance is being shunned by others.
@dalke BTW, this is a separate issue from commercial vs. academic. Software can be commercial and yet offer stable automated downloads. For example by putting a lock on the software and selling the key. Which is already what some software vendors do.
I also happen to be an advocate for Open Source in science, but that's another battle and a much more long-term one. Both the economics of research and the culture of some disciplines need to evolve, so it's decades rather than years.
I mean more that academic funding often carries with it the idea that public money should result in public code, which commercial software does not.
My experience is the views of "open science" by academic researchers is colored by how academics do research, which is different than how commercial researchers do science.
Furthermore, as a self-funded independent researcher, I have different constraints than an institutional researcher at a university or big company.
Konrad, you know I've been interested in Open Science practices almost as long as you have. I've tried for 15-20 years to "sell" commercial free software, and failed.
But Open Science doesn't answer concrete questions like: How long does the obligation last? How much money and time should I spend to meet it? How much work should one do to set up automatic downloads for an unknown audience which might not exist for 20 years, when I don't know their integration needs?
@dalke Your questions are interesting. Here is an attempt at some answers, from the point of view of a research software user.
> How long does the obligation last?
As long as you are willing and able to support your software. In commercial terms: as long as it makes sense for your business.
> How much work should one do to set up automatic downloads for an unknown audience [...] when I don't know their integration needs?
If you provide a stable URL to each release binary file or source archive over the lifetime of the project, you're already doing a good job.
The infrastructure behind it doesn't need to cost much: could be a home server in your garage, doesn't matter as long as the URLs are stable over the lifetime of your project.
Bonus points if you have a "sunsetting plan" that gives people enough notice to make arrangements once you decide to end maintenance of your software. How much time is enough depends on what the software does. How to sunset depends on how your business operates. If you don't want to maintain anymore, it's presumably because it no longer makes business sense, so a reasonable way would be to open source the last version you have and let the community maintain it if they need it.
> How much money and time should I spend to meet it?
Enough to support: 1) providing stable URLs for all releases for the lifetime of the project, and 2) defining ahead of time a "sunsetting plan" that you will be able to implement if/when needed.
So if I don't want to support it beyond publication, I need not do anything! :)
FWIW, I have a business so I can research as I want, even when it makes no business sense.
I don't have stable source URLs as each licensees gets a secret URL, w/o password, which I change every few years since employees move.
My experience with "the community" is maintainers are usually non-existent, FOSS users don't tell me they exist, and package mangers discourage personal connections.
@Guillawme @dalke If you come up with a way to make the binaries public, and sell just a key, you can also dump the binary to Zenodo and let Zenodo handle all the long-term costs. It's the best example of Open Science support infrastructure we have today.
@dalke We don't have enough experience with Open Science to answer many concrete questions. It remains an ideal for now.
Example: how long do obligations last? My theoretical answer is: as long as scientists need to revisit published work in order to build on it. For me, in my discipline, right now, that would be around 20 years.
@dalke
More importantly, long-term support should be ensured by institutions and infrastructure, not individual effort. Assuming we have the right infrastructure (which nobody is working on so far), an individual's moral obligation would just be compatibility with that infrastructure. A one-time effort, which in the long run (well-adapted tools etc.) would be very small.
Lack of infrastructure is the real weak point of Open Science in my opinion.
The thing about these high-level ideals is they don't at all guide the lower level. Are we morally obligated to put the system in a Docker container? To put things in a public repo? To set up fully automated reproducible builds? To prefer a language (not Python) which is stable for 20+ years? To not use proprietary/paywalled software at all? FOSS-only, or is "for non-commercial use" okay?
Each has tradeoffs, full purity is hard, and few have the right skills.
@dalke I agree that the hard work is at the lower technical levels. But moral guidelines have to be higher-level to be useful. Being a good citizen of the scientific community means something different today than ten years ago.
Nice example from biology: improving photographs of Western blots for clarity used to be good practice a few decades ago. Today it's fraud. What has changed is the context: more data, easier manipulation.
@dalke That means that your detailed questions an only be answered for a narrow time frame and a specific discipline. It's indeed a question of tradeoffs.
Do "convivial tools" require being able to install and (re)run research software fully automatically?
You are right that Python is not the right language choice, but I've revisited published work from the 1970s in Algol, and manually converted it to build on it. I think that's both moral and Open Science.
@Guillawme, while popular projects have many more users than devs, that's only when devs want to develop popular projects. Few researchers share that goal, with its complexities.
@dalke For me, conviviality is a topic mostly separate from reproducibility. You want conviviality short-term, reproducibility long-term. Ideally you want both, but with today's state of the art that's difficult.
Revisiting published work from the 1970s is great! But probably you could do it only because of the smaller scale of the 1970s. Today's computational science involves more and more complex code and data.
Are we still using "reproducibility" for both the ability to re-run the code exactly (eg in a container) and to verify correctness (perhaps w/ manual steps) ? That's always bugged me, in the free=gratis/libre way.
The computational research I do has few dependencies, but I know what you mean.
On the other hand, if you do need a bigger and deeper stack, isn't that saying 'individual researchers and small teams' simply need more professionalization than we did in the 1990s?
@dalke Jargon still hasn't fully stabilized, but the dominant usage is "reproducibility" for the former and "replicability" for the latter.
Software stacks have definitely grown, but I am not convinced that this is/was necessary. My theory is that professionalization was the cause, not the effect. "We need to program like the pros" is something I have heard more and more often over the years.