-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save fitted transformer object #141
Comments
Hi @belcheva, could you elaborate a little more? Do you mean that you're saving and restoring the DataTransformer so that you can re-use it between multiple training runs and/or between multiple calls of Also, note that the transformer does not run on the GPU, only the main model does. |
Hi, @fealho, sorry, for this question the GPU was indeed irrelevant. The idea is to be able to save and restore the DataTransformer between mutliple calls of When you train a new model with different parameters you don’t have to call As far as I understand, it is a problem to save and load only the transformed data itself, because the DataTransformer properties and methods are used in I hope this is more clear, I am open for any questions! |
@csala what do you think? I can see the use of this for CTGAN, but I’m not sure if it’s applicable to SDV as well. |
Well, the interesting thing is that this functionality is actually already implemented in SDV! In SDV all the tabular models fit a So my conclusion here is: Yes, this is a useful functionality. But on the other side, I do not think it is worth to implement this as a special feature or even part of CTGAN. Instead, what makes more sense is to end up decoupling the @belcheva I'm also curious, if possible: would you mind explaining what you are doing with CTGAN and what is the use case in which you are fitting multiple models with the same transformer? |
@csala We use this to speed up parameter tuning and find out which architecture gives the best quality for our dataset. |
Hi @belcheva, I've been interested in exactly the feature you've described, i.e., loading the fitted transformer data for later use. Could you make your code available? |
Problem Description
When working with large datasets fitting the transformer to the data takes a long time (for example a sample of 500 000 rows and 1500 columns takes around 10 hours on nvidia Quadro RTX 6000). Currently we use a self-built-in feature in the .fit() method of the CTGANSynthesizer class to save and load a fitted transformer object.
Maybe these adjustments could be useful for other users working on large scale data?
I could prepare a PR in case this would be useful.
The text was updated successfully, but these errors were encountered: