Contributing to PyDistintoX
Thank you for your interest in contributing to PyDistintoX. Whether you're here to report a bug, suggest a feature, or submit code, your help is greatly appreciated.
How Can I Contribute
Reporting Bugs
- Use GitLab Issues to report bugs.
- Describe the issue in as much detail as possible.
Pull Requests
- Fork the repository and create a new branch for your changes.
- Make sure all tests (see below) pass successfully.
- Keep your commit messages clear and concise.
Code Guidelines
Imports
The imports are ordered in the following structure:
- standard library imports as f.e. logging
- third-party imports as f.e. numpy
- application-specific imports
- as f.e. common.utils
- apart from typing, import common.utils always as common_utils in order to differentiate between intra-package imports as .utils from imports from common.utils
- intra-package imports
- as f.e. .utils relative imports
Styling
Try to follow PEP 8.
Use ' for normal strings, " for docstrings only.
Imports
Use these headings for imports: - standard library imports - third-party imports - tests-specific imports - imports from source code
Dependencies
to create requirements.txt enter
uv pip compile pyproject.toml -o requirements.txt
Tests
in order to run tests, enter
uv run pytest tests/TEST_FOLDER
The test results will be saved in ./tests/artifacts
You can add -v as option to get a verbose output or -s to show print statements, even if a test does not fail.
Deselect slow tests with pytest -m "not slow" (recommended!).
For the moment tests of the following kinds are possible (substitute for TEST_FOLDER above):
- benchmark-tests
- very slow
- not well implemented yet
- unit-tests
- very fast,
- check the functionality of single functions
- integration-tests
- slow
- runs the application from start to finish
- contrast-tests
- fast
- contrast the calculation in distinct_measures with the calculation of the pydistinto project
- example-data-tests
- slow
- runs end to end with the example data
- contains some tests, if the mathematis in distinct_measures works as expected
Use --log-cli-level=verboseto change logging mode to verbose.
Styling
- tests need unique names. Therefore the naming convention for tests is the following:
test_[function_under_test]_[test_kind]_[test_scenario (optionally)]
for example: test_get_measure_names_unit_specific_measures
Command Line Options
--benchmark-save=NAME \
--benchmark-storage=./tests/artifacts/performance/ \
--benchmark-histogram=tests/artifacts/performance/NAME \
--benchmark-compare=TEST_NUMBER_TO_COMPARE \
-m "not slow" \
-k "not example" # ignore `slow` and alsow `example`-tests
Tips and Tricks
- If
uvdoes not findpytest, delete your.venvand try again. - Module names needs to be unique.
- In order to save the log, use f.e.
TEST-COMMAND > ./tests/artifacts/tmp_data/$(date -Iminutes).log 2>&1
Overview of Code
Tree of Code
(last updated on 2/23/2026)
.
├── CONTRIBUTOR.md
├── data # contains all data produced by the application
│ ├── interim # contains intermediate data not intended for end users
│ │ ├── example # contains the result of nlp that is produced from the example data
│ │ └── json # NLP produces JSON data, which is saved here
│ │ ├── ref
│ │ └── tar
│ ├── results
│ │ ├── correlations_of_measures
│ │ │ └── example
│ │ │ ├── heatmap.png # a visualation of the correlation between the results of different measures
│ │ │ └── measure_correlations.csv # the corresponding correlation matrix of the heatmap
│ │ └── scores # contains the results of the data analysis
│ │ └── example
│ │ ├── afn.csv # Contains the results of a measure using 'afn' as smartirs parameter
│ │ ├── afn.html # HTML visualization of the above result (open in a browser)
│ │ ├── bfn.csv
│ │ ├── bfn.html
│ │ ├── ...
│ │ ...
│ └── texts # place the data to be analyzed in these two directories
│ ├── corp_ref # target corpus directory
│ ├── corp_tar # reference corpus directory
│ └── example_texts # example texts by Arthur Conan Doyle, used for demonstration purposes.
│ ├── corp_ref
│ │ ├── acd004.txt
│ │ ├── ...
│ │ ...
│ ├── corp_tar
│ │ ├── acd001.txt
│ │ ├── ...
│ │ ...
│ └── metadata.csv # contains metadata for the example texts, such as author, title, and publication date.
├── LICENSE
├── pyproject.toml # contains all formal data that pip install and uv run need to know
├── pytest.ini # config file for testing
├── README.md
├── requirements.txt # explicit dependencies to read
├── src # the source code is saved here
│ └── pydistintox # executing package
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py # contains all command-line options
│ ├── common # package that contains all code that is used by multiple other packages
│ │ ├── __init__.py
│ │ ├── config.py # contains static objects used in severall packages such as paths and classes
│ │ └── utils.py
│ ├── distinct_measures # this package contains all calculations apart from the nlp done by gensim
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── core.py # mainly these functions are exported
│ │ ├── load_matrices.py # not used anymore but may be useful in future
│ │ ├── measures.py # calculates 6 different statistical measures
│ │ └── utils.py
│ ├── main.py # this file is executed when running the application
│ ├── td_matrices # this package does the nlp, gives results of 6 measures (all done by gensim) and calculates term-document-matrices
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── core.py
│ │ ├── create_matrices.py
│ │ ├── io_utils.py # utilities for saving/loading NLP results as JSON
│ │ ├── parsing.py # functions for parsing texts; spacy is used here
│ │ ├── tfidf_measures.py # execution of the tf-idf measures
│ │ └── utils.py
│ └── visualize
│ ├── __init__.py
│ ├── config.py
│ ├── core.py # visualization of correlation of measures and visualization of scores
│ └── utils.py
├── tests # using pytest package, and unit tests for the most part; pympler to measure used RAM
│ ├── artifacts # data of the tests
│ │ ├── memory # data of memory-tests
│ │ ├── performance # data of performance tests
│ │ ├── test_data
│ │ └── tmp_data # used for example for test the creation of html files
│ ├── benchmark-tests # not well implemented yet
│ ├── config.py
│ ├── conftest.py # fixtures used in severall different kind of tests
│ ├── contrast-tests # contrasts pydistinto with pydistintox
│ ├── example-data-tests # the most comprehensive test here. Uses the dolye texts in data/example
│ ├── integration-tests # not well implemented yet
│ ├── script-tests # runs demo.py from docs/examples
│ └── unit-tests # most functions have their own unit tests; used frequently
└── uv.lock