Code owners
Assign users and groups as approvers for specific file changes. Learn more.
rdm.md 3.96 KiB
git for RDM and reproducibility
checklist
-
software (a generic tool to do something)
- use separate git repo for software
- tag versions for reproducibility
- keep software as generic as possible
-
scripts (how to use software)
- use separate git repo for scripts
- tag versions for reproducibility
- software is configured here
- reference used software tag
-
data management
- publish dataset(s) to scientific data archive system
- always attach proper metadata
- get DOI for each version of the dataset(s) for reproducibility
- reference used scripts tag
-
publishing
- use separate git repo for paper/thesis/...
- tag versions for draft/review/final
- convert text/source to (binary) products
- reference used scripts tag
- reference used data DOI
-
platforms (GitLab, GitHub)
- use platforms (GitLab, GitHub) for collaboration
- review commits / merge requests
- utilize project management tools
- utilize automation for testing and publishing
intro
- version control system (VCS) records changes (what, who, when, why)
- use platforms (GitLab, GitHub) for collaboration
git use cases
software
- keep software as generic as possible
- turn configuration/parameters into arguments, e.g.
myapp --seed=42
- this avoids having to rewrite software for parameter changes
- use software testing to verify software does what it's supposed to do
- tag versions to enable reproducibility
scripting
- separate scripting from software
- software: generic
- scripting: software called with specific configuration/arguments
- scripting means how to run the software
- i.e. here is where the parameters/arguments go
- think of it as digital lab notes
- this enables reproducibility
- specialized script variants for different environments, e.g.
- laptop
- RStudio / terminal server
- HPC cluster
- think about execution scalability, i.e. not having to change software and scripting when you want to change parameters
- keep failed attempts in branches to keep history of what you tried and why it didn't work in commit message
publishing
- for paper, thesis, book, presentation, documentation, blog posts
- use programming languages code/scripts for plots, flowcharts, etc.
- write text/paragraphs in markup language (e.g. markdown)
- use automation workflows to
- generate plot/flowchart code to image files
- convert text with pandoc to PDF/PS/HTML/ebup
- use platforms for review process
integration of use cases for reproducibility
anti patterns
An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.
- most git anti-patterns are about how to use git
- focus here is on these relating to RDM and reproducibility
binary files
- git as VCS only good for text files
- markdown
- source code, scripts
- (small) CSV
- binary files can't be diff'ed, e.g.
- compiled programs
- MS word, excel
- PDF, PS
- JPEG, PNG
- use textual representation, e.g.
- graphviz dot for flowcharts
- R ggplot and CSV for plots
- use automation to convert textual representation to e.g. images
- use gitignore to never add binary products to the repo
scientific data in git repos
- data is often binary
- git repo should be small, data blows it up, even if text
- data has different release cycles than code
- even git lfs (large file storage) is bad because still big ball of mud
- scientific datasets need metadata!
- use proper archive system for data
platforms
- enable collaboration
- bug tracker / feature requests
- documentation / wiki
- project management tools
- issue boards, milestones, gantt
- trigger automation
- publish/download releases
- go to https://git.idiv.de log in and create new projects!