Contributing to PyDistintoX
Thank you for your interest in contributing to PyDistintoX. Whether you're here to report a bug, suggest a feature, or submit code, your help is greatly appreciated.
How to Contribute
Reporting Bugs
- Use GitLab Issues to report bugs.
- Describe the issue in as much detail as possible.
Pull Requests
- Fork the repository and create a new branch for your changes.
- Make sure all tests (see below) pass successfully.
- Keep your commit messages clear and concise.
Code Guidelines
Imports
The imports are ordered in the following structure:
- standard library imports as f.e.
logging - third-party imports as f.e.
numpy - application-specific imports - as f.e.
common.utils- apart from typing, importcommon.utilsalways ascommon_utilsin order to differentiate between intra-package imports as.utilsfrom imports fromcommon.utils - intra-package imports - as f.e.
.utilsrelative imports
Styling
Using Black Formatter for Python and Prettier for all else.
Import Headings
Use these headings for imports:
- standard library imports
- third-party imports
- tests-specific imports
- imports from source code
Dependencies
To update requirements.txt after changing dependencies in pyproject.toml, run:
uv export --no-hashes --no-dev --no-header --no-emit-project -o requirements.txt
requirements.txt is automatically checked in the CI pipeline on every merge to main. The pipeline will fail if it is not up to date.
uv Version
The CI pipeline uses uv 0.11.15. To keep your local environment in sync, update uv with:
uv self update
Tests
In order to run tests, enter
uv run pytest tests/TEST_FOLDER
The test results will be saved in ./tests/artifacts
You can add -v as option to get a verbose output or -s to show print statements, even if a test does not fail.
Deselect slow tests with pytest -m "not slow" (recommended!).
For the moment tests of the following kinds are possible (substitute for TEST_FOLDER above):
benchmark-tests- very slow
- not well implemented yet
unit-tests- very fast
- check the functionality of single functions
integration-tests- slow
- runs the application end to end with the example data; checks that the mathematics in distinct_measures works as expected
script-tests- shell-based tests for the CLI and example scripts
Styling
- tests need unique names. Therefore the naming convention for tests is the following:
test_[function_under_test]_[test_kind]_[test_scenario (optionally)]
for example: test_get_measure_names_unit_specific_measures
Command Line Options
--benchmark-save=NAME \
--benchmark-storage=./tests/artifacts/performance/ \
--benchmark-histogram=tests/artifacts/performance/NAME \
--benchmark-compare=TEST_NUMBER_TO_COMPARE \
-m "not slow" \
-k "not example" # ignore `slow` and alsow `example`-tests
Tips and Tricks
- If
uvdoes not findpytest, delete your.venvand try again. - Module names needs to be unique.
- In order to save the log, use f.e.
TEST-COMMAND > ./tests/artifacts/tmp_data/$(date -Iminutes).log 2>&1
Overview of Code
Tree of Code
(last updated on 2026-05-29)
.
├── CONTRIBUTING.md
├── data # contains all data produced by the application
│ ├── example_texts # example texts by Arthur Conan Doyle, used for demonstration purposes
│ │ ├── ref
│ │ │ ├── acd004.txt
│ │ │ ├── ...
│ │ │ ...
│ │ └── tar
│ │ ├── acd001.txt
│ │ ├── ...
│ │ ...
│ ├── interim # contains intermediate data not intended for end users
│ │ └── json # NLP produces JSON data, which is saved here (via --save-nlp)
│ │ ├── ref
│ │ └── tar
│ └── results
│ ├── correlations_of_measures
│ │ └── heatmap.html # visualization of the correlation between the results of different measures
│ └── scores # contains the results of the data analysis
│ ├── afn.csv # Contains the results of a measure using 'afn' as smartirs parameter
│ ├── afn.html # HTML visualization of the above result (open in a browser)
│ ├── bfn.csv
│ ├── bfn.html
│ ├── ...
│ ...
├── LICENSE
├── pyproject.toml # contains all formal data that pip install and uv run need to know
├── pytest.ini # config file for testing
├── README.md
├── requirements.txt # explicit dependencies to read
├── scripts
│ ├── check_requirements.sh # checks that requirements.txt is up to date
│ └── check_version_bump.sh # checks that the version has been bumped
├── src # the source code is saved here
│ └── pydistintox # executing package
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py # contains all command-line options
│ ├── common # package that contains all code that is used by multiple other packages
│ │ ├── __init__.py
│ │ ├── config.py # contains static objects used in severall packages such as paths and classes
│ │ └── utils.py
│ ├── distinct_measures # this package contains all calculations apart from the nlp done by gensim
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── core.py # mainly these functions are exported
│ │ ├── load_matrices.py # not used anymore but may be useful in future
│ │ ├── measures.py # calculates 6 different statistical measures
│ │ └── utils.py
│ ├── main.py # this file is executed when running the application
│ ├── td_matrices # this package does the nlp, gives results of 6 measures (all done by gensim) and calculates term-document-matrices
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── core.py
│ │ ├── create_matrices.py
│ │ ├── io_utils.py # utilities for saving/loading NLP results as JSON
│ │ ├── parsing.py # functions for parsing texts; spacy is used here
│ │ ├── tfidf_measures.py # execution of the tf-idf measures
│ │ └── utils.py
│ └── visualize
│ ├── __init__.py
│ ├── config.py
│ ├── core.py # visualization of correlation of measures and visualization of scores
│ └── utils.py
├── tests # using pytest package; pympler to measure used RAM
│ ├── artifacts # data of the tests
│ │ ├── memory # data of memory-tests
│ │ ├── performance # data of performance tests
│ │ ├── sanity_data
│ │ ├── test_data
│ │ └── tmp_data # used for example for test the creation of html files
│ ├── benchmark-tests # not well implemented yet
│ ├── config.py
│ ├── conftest.py # fixtures used in severall different kind of tests
│ ├── integration-tests # runs the application from start to finish
│ │ ├── conftest.py
│ │ ├── distinct_measures
│ │ └── td_matrices
│ ├── script-tests # shell-based tests for the CLI and the self-parsed workflow
│ │ ├── test_cli.sh
│ │ └── test_self_parsed.sh
│ └── unit-tests # most functions have their own unit tests; used frequently
└── uv.lock