Contributing to PyDistintoX

Thank you for your interest in contributing to PyDistintoX. Whether you're here to report a bug, suggest a feature, or submit code, your help is greatly appreciated.

How Can I Contribute

Reporting Bugs

Use GitLab Issues to report bugs.
Describe the issue in as much detail as possible.

Pull Requests

Fork the repository and create a new branch for your changes.
Make sure all tests (see below) pass successfully.
Keep your commit messages clear and concise.

Code Guidelines

Imports

The imports are ordered in the following structure: - standard library imports as f.e. logging - third-party imports as f.e. numpy - application-specific imports - as f.e. common.utils - apart from typing, import common.utils always as common_utils in order to differentiate between intra-package imports as .utils from imports from common.utils - intra-package imports - as f.e. .utils relative imports

Styling

Try to follow PEP 8. Use ' for normal strings, " for docstrings only.

Imports

Use these headings for imports: - standard library imports - third-party imports - tests-specific imports - imports from source code

Dependencies

to create requirements.txt enter uv pip compile pyproject.toml -o requirements.txt

Tests

in order to run tests, enter

uv run pytest tests/TEST_FOLDER

The test results will be saved in ./tests/artifacts You can add -v as option to get a verbose output or -s to show print statements, even if a test does not fail. Deselect slow tests with pytest -m "not slow" (recommended!).

For the moment tests of the following kinds are possible (substitute for TEST_FOLDER above): - benchmark-tests - very slow - not well implemented yet - unit-tests - very fast, - check the functionality of single functions - integration-tests - slow - runs the application from start to finish - contrast-tests - fast - contrast the calculation in distinct_measures with the calculation of the pydistinto project - example-data-tests - slow - runs end to end with the example data - contains some tests, if the mathematis in distinct_measures works as expected

Use --log-cli-level=verboseto change logging mode to verbose.

Styling

tests need unique names. Therefore the naming convention for tests is the following:

test_[function_under_test]_[test_kind]_[test_scenario (optionally)]

for example: test_get_measure_names_unit_specific_measures

Command Line Options

      --benchmark-save=NAME \
      --benchmark-storage=./tests/artifacts/performance/ \
      --benchmark-histogram=tests/artifacts/performance/NAME \
      --benchmark-compare=TEST_NUMBER_TO_COMPARE \
      -m "not slow" \
      -k "not example"  # ignore  `slow` and alsow `example`-tests

Tips and Tricks

If uv does not find pytest, delete your .venv and try again.
Module names needs to be unique.
In order to save the log, use f.e. TEST-COMMAND > ./tests/artifacts/tmp_data/$(date -Iminutes).log 2>&1

Overview of Code

Tree of Code

(last updated on 2/23/2026)

.
├── CONTRIBUTOR.md 
├── data            # contains all data produced by the application
│   ├── interim     # contains intermediate data not intended for end users
│   │   ├── example # contains the result of nlp that is produced from the example data
│   │   └── json    # NLP produces JSON data, which is saved here
│   │       ├── ref
│   │       └── tar
│   ├── results
│   │   ├── correlations_of_measures
│   │   │    └── example 
│   │   │        ├── heatmap.png                 # a visualation of the correlation between the results of different measures
│   │   │        └── measure_correlations.csv    # the corresponding correlation matrix of the heatmap
│   │   └── scores                               # contains the results of the data analysis
│   │        └── example 
│   │             ├── afn.csv                    # Contains the results of a measure using 'afn' as smartirs parameter
│   │             ├── afn.html                   # HTML visualization of the above result (open in a browser) 
│   │             ├── bfn.csv                     
│   │             ├── bfn.html                    
│   │             ├── ...
│   │             ...
│   └── texts                      # place the data to be analyzed in these two directories
│       ├── corp_ref               # target corpus directory
│       ├── corp_tar               # reference corpus directory
│       └── example_texts          # example texts by Arthur Conan Doyle, used for demonstration purposes.
│           ├── corp_ref
│           │   ├── acd004.txt
│           │   ├── ...
│           │   ...
│           ├── corp_tar
│           │   ├── acd001.txt
│           │   ├── ...
│           │   ...
│           └── metadata.csv      # contains metadata for the example texts, such as author, title, and publication date.
├── LICENSE
├── pyproject.toml      # contains all formal data that pip install and uv run need to know
├── pytest.ini          # config file for testing
├── README.md
├── requirements.txt    # explicit dependencies to read
├── src                                    # the source code is saved here
│   └── pydistintox                    # executing package
│       ├── __init__.py
│       ├── __main__.py
│       ├── cli.py                         # contains all command-line options
│       ├── common                         # package that contains all code that is used by multiple other packages
│       │   ├── __init__.py
│       │   ├── config.py                  # contains static objects used in severall packages such as paths and classes
│       │   └── utils.py
│       ├── distinct_measures              # this package contains all calculations apart from the nlp done by gensim
│       │   ├── __init__.py
│       │   ├── config.py   
│       │   ├── core.py                    # mainly these functions are exported
│       │   ├── load_matrices.py           # not used anymore but may be useful in future
│       │   ├── measures.py                # calculates 6 different statistical measures
│       │   └── utils.py
│       ├── main.py                        # this file is executed when running the application
│       ├── td_matrices                    # this package does the nlp, gives results of 6 measures (all done by gensim) and calculates term-document-matrices
│       │   ├── __init__.py
│       │   ├── config.py
│       │   ├── core.py
│       │   ├── create_matrices.py     
│       │   ├── io_utils.py                # utilities for saving/loading NLP results as JSON
│       │   ├── parsing.py                 # functions for parsing texts; spacy is used here
│       │   ├── tfidf_measures.py          # execution of the tf-idf measures
│       │   └── utils.py
│       └── visualize
│           ├── __init__.py
│           ├── config.py
│           ├── core.py                    # visualization of correlation of measures and visualization of scores
│           └── utils.py
├── tests                           # using pytest package, and unit tests for the most part; pympler to measure used RAM
│   ├── artifacts                   # data of the tests
│   │   ├── memory                  # data of memory-tests
│   │   ├── performance             # data of performance tests
│   │   ├── test_data
│   │   └── tmp_data                # used for example for test the creation of html files
│   ├── benchmark-tests             # not well implemented yet
│   ├── config.py                   
│   ├── conftest.py                 # fixtures used in severall different kind of tests
│   ├── contrast-tests              # contrasts pydistinto with pydistintox
│   ├── example-data-tests          # the most comprehensive test here. Uses the dolye texts in data/example
│   ├── integration-tests           # not well implemented yet
│   ├── script-tests                # runs demo.py from docs/examples
│   └── unit-tests                  # most functions have their own unit tests; used frequently
└── uv.lock

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search