Contributing to PyDistintoX

Thank you for your interest in contributing to PyDistintoX. Whether you're here to report a bug, suggest a feature, or submit code, your help is greatly appreciated.

How to Contribute

Reporting Bugs

  • Use GitLab Issues to report bugs.
  • Describe the issue in as much detail as possible.

Pull Requests

  • Fork the repository and create a new branch for your changes.
  • Make sure all tests (see below) pass successfully.
  • Keep your commit messages clear and concise.

Code Guidelines

Imports

The imports are ordered in the following structure:

  • standard library imports as f.e. logging
  • third-party imports as f.e. numpy
  • application-specific imports - as f.e. common.utils - apart from typing, import common.utils always as common_utils in order to differentiate between intra-package imports as .utils from imports from common.utils
  • intra-package imports - as f.e. .utils relative imports

Styling

Using Black Formatter for Python and Prettier for all else.

Import Headings

Use these headings for imports:

  • standard library imports
  • third-party imports
  • tests-specific imports
  • imports from source code

Dependencies

To update requirements.txt after changing dependencies in pyproject.toml, run:

uv export --no-hashes --no-dev --no-header --no-emit-project -o requirements.txt

requirements.txt is automatically checked in the CI pipeline on every merge to main. The pipeline will fail if it is not up to date.

uv Version

The CI pipeline uses uv 0.11.15. To keep your local environment in sync, update uv with:

uv self update

Tests

In order to run tests, enter

uv run pytest tests/TEST_FOLDER

The test results will be saved in ./tests/artifacts You can add -v as option to get a verbose output or -s to show print statements, even if a test does not fail. Deselect slow tests with pytest -m "not slow" (recommended!).

For the moment tests of the following kinds are possible (substitute for TEST_FOLDER above):

  • benchmark-tests
  • very slow
  • not well implemented yet
  • unit-tests
  • very fast
  • check the functionality of single functions
  • integration-tests
  • slow
  • runs the application end to end with the example data; checks that the mathematics in distinct_measures works as expected
  • script-tests
  • shell-based tests for the CLI and example scripts

Styling

  • tests need unique names. Therefore the naming convention for tests is the following:
test_[function_under_test]_[test_kind]_[test_scenario (optionally)]

for example: test_get_measure_names_unit_specific_measures

Command Line Options

      --benchmark-save=NAME \
      --benchmark-storage=./tests/artifacts/performance/ \
      --benchmark-histogram=tests/artifacts/performance/NAME \
      --benchmark-compare=TEST_NUMBER_TO_COMPARE \
      -m "not slow" \
      -k "not example"  # ignore  `slow` and alsow `example`-tests

Tips and Tricks

  • If uv does not find pytest, delete your .venv and try again.
  • Module names needs to be unique.
  • In order to save the log, use f.e. TEST-COMMAND > ./tests/artifacts/tmp_data/$(date -Iminutes).log 2>&1

Overview of Code

Tree of Code

(last updated on 2026-05-29)

.
├── CONTRIBUTING.md
├── data            # contains all data produced by the application
│   ├── example_texts              # example texts by Arthur Conan Doyle, used for demonstration purposes
│   │   ├── ref
│   │   │   ├── acd004.txt
│   │   │   ├── ...
│   │   │   ...
│   │   └── tar
│   │       ├── acd001.txt
│   │       ├── ...
│   │       ...
│   ├── interim     # contains intermediate data not intended for end users
│   │   └── json    # NLP produces JSON data, which is saved here (via --save-nlp)
│   │       ├── ref
│   │       └── tar
│   └── results
│       ├── correlations_of_measures
│       │    └── heatmap.html                # visualization of the correlation between the results of different measures
│       └── scores                           # contains the results of the data analysis
│            ├── afn.csv                     # Contains the results of a measure using 'afn' as smartirs parameter
│            ├── afn.html                    # HTML visualization of the above result (open in a browser)
│            ├── bfn.csv
│            ├── bfn.html
│            ├── ...
│            ...
├── LICENSE
├── pyproject.toml      # contains all formal data that pip install and uv run need to know
├── pytest.ini          # config file for testing
├── README.md
├── requirements.txt    # explicit dependencies to read
├── scripts
│   ├── check_requirements.sh       # checks that requirements.txt is up to date
│   └── check_version_bump.sh       # checks that the version has been bumped
├── src                                    # the source code is saved here
│   └── pydistintox                    # executing package
│       ├── __init__.py
│       ├── __main__.py
│       ├── cli.py                         # contains all command-line options
│       ├── common                         # package that contains all code that is used by multiple other packages
│       │   ├── __init__.py
│       │   ├── config.py                  # contains static objects used in severall packages such as paths and classes
│       │   └── utils.py
│       ├── distinct_measures              # this package contains all calculations apart from the nlp done by gensim
│       │   ├── __init__.py
│       │   ├── config.py
│       │   ├── core.py                    # mainly these functions are exported
│       │   ├── load_matrices.py           # not used anymore but may be useful in future
│       │   ├── measures.py                # calculates 6 different statistical measures
│       │   └── utils.py
│       ├── main.py                        # this file is executed when running the application
│       ├── td_matrices                    # this package does the nlp, gives results of 6 measures (all done by gensim) and calculates term-document-matrices
│       │   ├── __init__.py
│       │   ├── config.py
│       │   ├── core.py
│       │   ├── create_matrices.py
│       │   ├── io_utils.py                # utilities for saving/loading NLP results as JSON
│       │   ├── parsing.py                 # functions for parsing texts; spacy is used here
│       │   ├── tfidf_measures.py          # execution of the tf-idf measures
│       │   └── utils.py
│       └── visualize
│           ├── __init__.py
│           ├── config.py
│           ├── core.py                    # visualization of correlation of measures and visualization of scores
│           └── utils.py
├── tests                           # using pytest package; pympler to measure used RAM
│   ├── artifacts                   # data of the tests
│   │   ├── memory                  # data of memory-tests
│   │   ├── performance             # data of performance tests
│   │   ├── sanity_data
│   │   ├── test_data
│   │   └── tmp_data                # used for example for test the creation of html files
│   ├── benchmark-tests             # not well implemented yet
│   ├── config.py
│   ├── conftest.py                 # fixtures used in severall different kind of tests
│   ├── integration-tests           # runs the application from start to finish
│   │   ├── conftest.py
│   │   ├── distinct_measures
│   │   └── td_matrices
│   ├── script-tests                # shell-based tests for the CLI and the self-parsed workflow
│   │   ├── test_cli.sh
│   │   └── test_self_parsed.sh
│   └── unit-tests                  # most functions have their own unit tests; used frequently
└── uv.lock