-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. #3897
Comments
I can give you some suggestions try to do that: |
-there is no data preprocessing. It just has the smiles and task. butinasplitter = dc.splits.ButinaSplitter() |
Hey @v-saini
Here's how you can modify your code: import pandas as pd
import deepchem as dc
import tempfile
# Read the CSV file once and store it as a global variable
df = pd.read_csv('file.csv')
# Create a temporary directory for saving the dataset and set the seed
tmpdir = tempfile.TemporaryDirectory()
seed = 2
# Set random seeds for various components
dc.utils.set_random_seed(seed)
with dc.utils.UniversalNamedTemporaryFile(mode='w',
directory=str(tmpdir.name), name='task1.csv') as tmpfile:
df.to_csv(tmpfile.name)
loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False))
dataset = loader.create_dataset(tmpfile.name)
model = dc.models.GraphConvModel(n_tasks=1, mode='regression',
dropout=0.2)
# Use the same seed for initializing the splitter and model
butinasplitter = dc.splits.ButinaSplitter()
train_dataset, test_dataset = butinasplitter.train_test_split(dataset,
frac_train=.80, seed=seed)
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
# Print the train and test set scores
print("Training set score:", model.evaluate(train_dataset, [metric]))
print("Test set score:", model.evaluate(test_dataset, [metric]))
# Close the temporary directory when done
tmpdir.close() Now, every time you run your code, it should give you consistent |
❓ Questions & Help
I am having issue reproducing my results even though, I have put in the seed parameter. For example, in the code below after generating the dataset, every time I rerun the splitter and fit the model I get quite different train test results. I would not have bother had the difference been small. Sometimes, I get 0.92 R2 score and sometimes 0.72. I have checked the split is same every time but the graphs objects are different. What can I do for reproducibility?
df = pd.read_csv('file.csv')
with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
df.to_csv(tmpfile.name)
loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False))
dataset = loader.create_dataset(tmpfile.name)
model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)
butinasplitter = dc.splits.ButinaSplitter()
train_dataset, test_dataset = butinasplitter.train_test_split(dataset, frac_train=.80, seed=2)
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric]))
print("Test set score:", model.evaluate(test_dataset, [metric]))
The text was updated successfully, but these errors were encountered: