1. Nature really does value getting studies correct, and thus is probably even more stringent about retractions. They will retract studies for smaller problems than other journals. So, high bar of entry, and also highly focused on vetting any challenges that would cause a retraction.
2. There are anti-climate change groups that challenge everything. So just some random link to someone challenging this, doesn't immediately make me doubt it. The trust in Nature is high enough that the challenger must be pretty good, not just saying "well this model looks too simplistic too me, thus it is false".
"Should be acceptable" is the wrong way to look at it. "Should be illegal" is the question that should be answered.
There's a subtle but important distinction between
> Alcohol is allowed, so cannabis should be allowed.
and
> Alcohol is not illegal, so cannabis should not be illegal.
Our legal system is a default-allow denylist: unless a law says you can't do something, you can. The government doesn't grant permission. It removes it.
So because the government hasn't made the case that we should ban alcohol, I think it's on them to prove that cannabis is somehow worse to justify its banning.
But I'd add getting rid of heatmaps on large datasets. They are information dense and pretty, but I can't see how anyone interprets them.
Better to do clustering and plot the data for each relevant cluster in a more meaningful way.
I suppose there are already many articles showing how to speed calculations by avoiding/optimizing pandas.
It does feel a little unfair a comparison. Everyone knows that for loops are slow in python.. as is much of the core library. But pushing analysis to c using pythonic APIs (numpy/numba/pytorch) is fairly trivial
So my question to the non bioinformatics - is this already a solved problem?
You have tasks which require resources based on the input parameters, these are run in docker containers to ensure the environment and you want to track the output of each step.
Often these are embarrassingly parallel operations (e.g. I have 200 samples to do the same thing on).
Something like dask perhaps,but can specify a docker image for the task?
What is the goto in DevOps for similar tasks? GitHub actions comes pretty close...
To bioinformatics what is the unique selling point of next flow over say wdl/Cromwell?
I do computational physics and I use Snakemake.
On HPCs, we only have user-level access. We are not allowed to perform long-running processes on login nodes, could be killed at any time due to a violation of rules.
That said, anything that depends on Docker is a no-go; anything that uses a server-client structure we will try to avoid (although it might be possible for us to host a daemon elsewhere that we as students pay out of our own pocket).
We also deal with a lot of tools that are not well-written in Python or modern languages, you wouldn't want to build any CFFI onto it.
So Snakemake, and similarly, Nextflow, suits our needs well. It is a user-space CLI tool that does not require any privileges, it optimizes for running bash command / any CLI-based tools. A bonus for Snakemake is that it uses Python and our other scripts use Python too.
So I guess DevOps tooling, which heavily bias towards docker or whatever container-based execution, is really a different space.
The big difference when comparing bioinformatics systems with non are what the typical payload of a DAG node is and what optimizations that indicates. Most other domains don’t have DAG nodes that assume the payload is a crappy command line call and expecting inputs/outputs to magically be in specific places on a POSIX file system.
You can do this on other systems but it’s nice to have the headache abstracted away for you.
The other major difference is assumption of lifecycle. In most biz domains you don’t have researchers iterating on these things the way you do in bioinf. The newer ML/DS systems do solve this problem than say Aorflow
I for one have started to appreciate the fact that the shell/commandline interface means:
- We have an interface that very strongly imposes composability, that is rarely seen in other parts of IT, and making people actually "follow the rules" :D
- Data is (mostly) treated as immutable, except perhaps inside tools
- Data is cached
- The cli boundaries means that at least one can inspect inputs/outputs as a way to debug.
- Etc...
Personally, the biggest frustration is all the inconsistencies in how people design the commandline interfaces. Primarily that output filenames are so often created based on non-obvious and sometimes arbitrary rules, rather than being specified by the user. If all filenames were specified (or at least possible to specify) via the CLI, pipeline managers would have such an enormously easier time.
What happens now is that you basically need a mechanism like Nextflow has, where all commands are executed in a temp directory, and the pipeline tool just globs up all the generated files afterwards. This works, but opens a lot of possibilities for mistakes in how files are tracked (might be routed to the wrong downstream output, if you do something funny with the naming, such that two output path patterns overlap).
nextflow can't even get this right- base nextflow uses some combination of `--paramName` and `--param-name` and treats them as interchangeable, while nf-core encourages `--param_name` (but nextflow sees that as different). All trivial differences but just layers on the CLI frustration train.
>2017/01/31 23:00-ish
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
>2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left
> YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.
Then in the post-mortem about lack of backups:
> LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
> Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
I have had (and inevitability will have again) bad days like poor YP. All I can count on is to maintain good habits, like making backups before undergoing production work like YP did.
> like making backups before undergoing production work
The specific part you mention also brings up a really vital part of a backup system, testing that the backups generated actually can restored.
I've seen so many companies with untested recovery procedures where most of the time they just state something like "Of course the built-in backup mechanism work, if it didn't, it wouldn't be much of a backup, would it? Haha" while never actually tried to recover from it.
Although, to be fair, I've only seen one time out of the untested 10s where it had an actual impact and the backups actually didn't work, but the morale hit that the company ended up having made my brain really remember the fact to test your backups.
Indeed, the feeling of dread when you do something that causes prod to go down is bad enough. I can't even imagine the feeling when accidentally deleting prod data...
It's just the letters A, G, T, and C over and over again, right? surely we can squeeze some better compression out of that, dictionary coding or something?
I would assume most datasets people are using aren't large enough to justify using a database versus parsing a flat file. You also have to realize this is a field of scientific code, where it matters more that you spend less time coding and more time interpreting results, versus spending time optimizing the pipeline to minimize compute time, and you might be working on a university cluster where your compute is powerful and quite cheap. For people who might work on clinical pipelines that will continue to be reran time again over a vast growing amount of patient data, they probably already put their data into databases. For your academic post doc working on 2000 samples from an experiment for one paper before they find another job doing something else entirely in two years, a flat file is fine.
I was thinking for example in VCF files. A metadata header, a main table with eight clear columns and a ninth column that works as a "put here whatever you need", and then the related data for each sample in extra columns.
Next thing you have is a set of tools to recreate a small subset of SQL, to index the file, to add in bulk, to edit the metadata...
The typical VCF has data enough to be a SQLite, and nobody parses the VCF directly but with tools.
This ends in a sad number of bio-scientists that cannot do the simplest SQL query, but know perfectly vcftools, samtools, bedtools and others (or have them hardcoded in shell scripts). Those formats start so simple you can "parse" them with grep, cut, wc and paste, but soon they need special tooling and get feature creep.
And it’s much easier to teach a grad student who already knows some basic bash, R, or Python how to read a flat file into a data frame and make some plots or grep a few lines versus dealing with databases. While bioinformatics tooling is outdated in many ways, modern software engineering could do with more config.txt and fewer hidden SQLite databases holding settings only accessible by an electron GUI.