Synthetic data from PARSynthesizer does not follow original data distribution #2230

PaudGS · 2024-09-18T08:49:34Z

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

SDV version: 1.16.1
Python version: 3.11.9
Operating System: WIndows

Problem description

Trying to create synthetic numeric values using PARSynthesizer returns values very close to the mean of the original distribution, with little variance between values.
The data is a simple table consisting of patient_id(sequence_id), mesure_id, measure_date_time(sequence key) and value of measurement.

The histograms of both distributions look like this:

What I already tried

I have tried different epoch values, running with a larger input dataset and the different RDT transforms.
Running the same data with the GaussianCopulaSynthesizer yields much better results, but I would lose the time series aspect of the original data.

Is this the expected behaviour of the PARSynthesizer or am I doing something wrong?

srinify · 2024-09-19T20:45:42Z

Hi @PaudGS 👋

At the moment, our single table and multi table synthesizers are definitely a bit more mature than PARSynthesizer, our sequential synthesizer. So this difference alone might be causing the shortcoming you're experiencing unfortunately, especially if you've already experimented with different epochs and different transformers.

To rule out a few more things, it would be helpful if you could share your metadata, the column(s) you care the most about, and maybe some sample values that represent the rough distribution (e.g. you can take your original values but scale them by a factor to add a layer of fuzziness). Oh, and also some more context into your use case in general!

This way, I can try to replicate the same distributions on my end, then suggest any possible improvements, and if needed we can document the shortcomings you encountered in a new issue for the team!

PaudGS added new Automatic label applied to new issues question General question about the software labels Sep 18, 2024

srinify self-assigned this Sep 19, 2024

srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Sep 19, 2024

srinify added the data:sequential Related to timeseries datasets label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic data from PARSynthesizer does not follow original data distribution #2230

Synthetic data from PARSynthesizer does not follow original data distribution #2230

PaudGS commented Sep 18, 2024

srinify commented Sep 19, 2024 •

edited

Loading

Synthetic data from PARSynthesizer does not follow original data distribution #2230

Synthetic data from PARSynthesizer does not follow original data distribution #2230

Comments

PaudGS commented Sep 18, 2024

Environment details

Problem description

What I already tried

srinify commented Sep 19, 2024 • edited Loading

srinify commented Sep 19, 2024 •

edited

Loading