Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrabNet sometimes ignores/skips certain compounds. Why? How to keep track of compound IDs? #13

Open
sgbaird opened this issue Sep 15, 2021 · 4 comments
Labels
question Further information is requested

Comments

@sgbaird
Copy link
Collaborator

sgbaird commented Sep 15, 2021

Say there are 10,000 validation compounds and only ~9900 get pushed to the validation results. First, is this likely because of repeated compounds or certain compounds being "invalid"? Second, how to keep track of the "ID" of each compound? (in my case a Materials Project task_id).

@anthony-wang
Copy link
Owner

It's most likely due to single-element compounds which are dropped by default when loading the featurized EDM. You can find the code: https://github.com/anthony-wang/CrabNet/blob/master/utils/utils.py#L453

Also, you can check if you have duplicates in the formulae: https://github.com/anthony-wang/CrabNet/blob/master/utils/composition.py#L206

The generate_features function returns a list of skipped formulae, so you can look at that too to see what's skipped.

I'm not sure how to keep track of the MP task_id of the compound, I think that would depend on how you are storing them. Why not store and keep them separately?

@anthony-wang anthony-wang added the question Further information is requested label Sep 16, 2021
@sgbaird
Copy link
Collaborator Author

sgbaird commented Sep 19, 2021

Found a workaround via https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
related to

df = df.groupby(by='formula').mean().reset_index() # mean of duplicates

df = df.groupby(by='formula', as_index=False).mean()  # mean of duplicates

Though I'm not sure if this breaks anything

@sgbaird
Copy link
Collaborator Author

sgbaird commented Sep 19, 2021

Actually, maybe something more like this is what I should be looking for:

df = (
    df.reset_index()
    .groupby(by="formula")
    .agg({"index": lambda x: tuple(x), "target": "mean"})
    .reset_index()
)

https://stackoverflow.com/questions/49216357/how-to-keep-original-index-of-a-dataframe-after-groupby-2-columns

@sgbaird
Copy link
Collaborator Author

sgbaird commented Sep 19, 2021

Not sure how to factor dropping pure elements into this, other than by simply dropping them without tracking anything. Maybe by modifying groupby args.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants