Contributing to PyDistintoX

Thank you for your interest in contributing to PyDistintoX. Whether you're here to report a bug, suggest a feature, or submit code, your help is greatly appreciated.

How Can I Contribute

Reporting Bugs

  • Use GitLab Issues to report bugs.
  • Describe the issue in as much detail as possible.

Pull Requests

  • Fork the repository and create a new branch for your changes.
  • Make sure all tests (see below) pass successfully.
  • Keep your commit messages clear and concise.

Code Guidelines

Imports

The imports are ordered in the following structure: - standard library imports as f.e. logging - third-party imports as f.e. numpy - application-specific imports - as f.e. common.utils - apart from typing, import common.utils always as common_utils in order to differentiate between intra-package imports as .utils from imports from common.utils - intra-package imports - as f.e. .utils relative imports

Styling

Try to follow PEP 8. Use ' for normal strings, " for docstrings only.

Imports

Use these headings for imports: - standard library imports - third-party imports - tests-specific imports - imports from source code

Dependencies

to create requirements.txt enter uv pip compile pyproject.toml -o requirements.txt


Tests

in order to run tests, enter

uv run pytest tests/TEST_FOLDER

The test results will be saved in ./tests/artifacts You can add -v as option to get a verbose output or -s to show print statements, even if a test does not fail. Deselect slow tests with pytest -m "not slow" (recommended!).

For the moment tests of the following kinds are possible (substitute for TEST_FOLDER above): - benchmark-tests - very slow - not well implemented yet - unit-tests - very fast, - check the functionality of single functions - integration-tests - slow - runs the application from start to finish - contrast-tests - fast - contrast the calculation in distinct_measures with the calculation of the pydistinto project - example-data-tests - slow - runs end to end with the example data - contains some tests, if the mathematis in distinct_measures works as expected

Use --log-cli-level=verboseto change logging mode to verbose.

Styling

  • tests need unique names. Therefore the naming convention for tests is the following:
test_[function_under_test]_[test_kind]_[test_scenario (optionally)]

for example: test_get_measure_names_unit_specific_measures

Command Line Options

      --benchmark-save=NAME \
      --benchmark-storage=./tests/artifacts/performance/ \
      --benchmark-histogram=tests/artifacts/performance/NAME \
      --benchmark-compare=TEST_NUMBER_TO_COMPARE \
      -m "not slow" \
      -k "not example"  # ignore  `slow` and alsow `example`-tests

Tips and Tricks

  • If uv does not find pytest, delete your .venv and try again.
  • Module names needs to be unique.
  • In order to save the log, use f.e. TEST-COMMAND > ./tests/artifacts/tmp_data/$(date -Iminutes).log 2>&1

Overview of Code

Tree of Code

(last updated on 2/23/2026)

.
├── CONTRIBUTOR.md 
├── data            # contains all data produced by the application
│   ├── interim     # contains intermediate data not intended for end users
│   │   ├── example # contains the result of nlp that is produced from the example data
│   │   └── json    # NLP produces JSON data, which is saved here
│   │       ├── ref
│   │       └── tar
│   ├── results
│   │   ├── correlations_of_measures
│   │   │    └── example 
│   │   │        ├── heatmap.png                 # a visualation of the correlation between the results of different measures
│   │   │        └── measure_correlations.csv    # the corresponding correlation matrix of the heatmap
│   │   └── scores                               # contains the results of the data analysis
│   │        └── example 
│   │             ├── afn.csv                    # Contains the results of a measure using 'afn' as smartirs parameter
│   │             ├── afn.html                   # HTML visualization of the above result (open in a browser) 
│   │             ├── bfn.csv                     
│   │             ├── bfn.html                    
│   │             ├── ...
│   │             ...
│   └── texts                      # place the data to be analyzed in these two directories
│       ├── corp_ref               # target corpus directory
│       ├── corp_tar               # reference corpus directory
│       └── example_texts          # example texts by Arthur Conan Doyle, used for demonstration purposes.
│           ├── corp_ref
│           │   ├── acd004.txt
│           │   ├── ...
│           │   ...
│           ├── corp_tar
│           │   ├── acd001.txt
│           │   ├── ...
│           │   ...
│           └── metadata.csv      # contains metadata for the example texts, such as author, title, and publication date.
├── LICENSE
├── pyproject.toml      # contains all formal data that pip install and uv run need to know
├── pytest.ini          # config file for testing
├── README.md
├── requirements.txt    # explicit dependencies to read
├── src                                    # the source code is saved here
│   └── pydistintox                    # executing package
│       ├── __init__.py
│       ├── __main__.py
│       ├── cli.py                         # contains all command-line options
│       ├── common                         # package that contains all code that is used by multiple other packages
│       │   ├── __init__.py
│       │   ├── config.py                  # contains static objects used in severall packages such as paths and classes
│       │   └── utils.py
│       ├── distinct_measures              # this package contains all calculations apart from the nlp done by gensim
│       │   ├── __init__.py
│       │   ├── config.py   
│       │   ├── core.py                    # mainly these functions are exported
│       │   ├── load_matrices.py           # not used anymore but may be useful in future
│       │   ├── measures.py                # calculates 6 different statistical measures
│       │   └── utils.py
│       ├── main.py                        # this file is executed when running the application
│       ├── td_matrices                    # this package does the nlp, gives results of 6 measures (all done by gensim) and calculates term-document-matrices
│       │   ├── __init__.py
│       │   ├── config.py
│       │   ├── core.py
│       │   ├── create_matrices.py     
│       │   ├── io_utils.py                # utilities for saving/loading NLP results as JSON
│       │   ├── parsing.py                 # functions for parsing texts; spacy is used here
│       │   ├── tfidf_measures.py          # execution of the tf-idf measures
│       │   └── utils.py
│       └── visualize
│           ├── __init__.py
│           ├── config.py
│           ├── core.py                    # visualization of correlation of measures and visualization of scores
│           └── utils.py
├── tests                           # using pytest package, and unit tests for the most part; pympler to measure used RAM
│   ├── artifacts                   # data of the tests
│   │   ├── memory                  # data of memory-tests
│   │   ├── performance             # data of performance tests
│   │   ├── test_data
│   │   └── tmp_data                # used for example for test the creation of html files
│   ├── benchmark-tests             # not well implemented yet
│   ├── config.py                   
│   ├── conftest.py                 # fixtures used in severall different kind of tests
│   ├── contrast-tests              # contrasts pydistinto with pydistintox
│   ├── example-data-tests          # the most comprehensive test here. Uses the dolye texts in data/example
│   ├── integration-tests           # not well implemented yet
│   ├── script-tests                # runs demo.py from docs/examples
│   └── unit-tests                  # most functions have their own unit tests; used frequently
└── uv.lock