CrabNet sometimes ignores/skips certain compounds. Why? How to keep track of compound IDs? #13

sgbaird · 2021-09-15T23:47:06Z

Say there are 10,000 validation compounds and only ~9900 get pushed to the validation results. First, is this likely because of repeated compounds or certain compounds being "invalid"? Second, how to keep track of the "ID" of each compound? (in my case a Materials Project task_id).

The text was updated successfully, but these errors were encountered:

anthony-wang · 2021-09-16T05:50:59Z

It's most likely due to single-element compounds which are dropped by default when loading the featurized EDM. You can find the code: https://github.com/anthony-wang/CrabNet/blob/master/utils/utils.py#L453

Also, you can check if you have duplicates in the formulae: https://github.com/anthony-wang/CrabNet/blob/master/utils/composition.py#L206

The generate_features function returns a list of skipped formulae, so you can look at that too to see what's skipped.

I'm not sure how to keep track of the MP task_id of the compound, I think that would depend on how you are storing them. Why not store and keep them separately?

sgbaird · 2021-09-19T01:56:52Z

Found a workaround via https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
related to

CrabNet/utils/utils.py

Line 455 in 9e0d79c

df = df.groupby(by='formula').mean().reset_index() # mean of duplicates

df = df.groupby(by='formula', as_index=False).mean()  # mean of duplicates

Though I'm not sure if this breaks anything

sgbaird · 2021-09-19T02:25:29Z

Actually, maybe something more like this is what I should be looking for:

df = (
    df.reset_index()
    .groupby(by="formula")
    .agg({"index": lambda x: tuple(x), "target": "mean"})
    .reset_index()
)

https://stackoverflow.com/questions/49216357/how-to-keep-original-index-of-a-dataframe-after-groupby-2-columns

sgbaird · 2021-09-19T05:04:52Z

Not sure how to factor dropping pure elements into this, other than by simply dropping them without tracking anything. Maybe by modifying groupby args.

anthony-wang added the question Further information is requested label Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrabNet sometimes ignores/skips certain compounds. Why? How to keep track of compound IDs? #13

CrabNet sometimes ignores/skips certain compounds. Why? How to keep track of compound IDs? #13

sgbaird commented Sep 15, 2021

anthony-wang commented Sep 16, 2021

sgbaird commented Sep 19, 2021

sgbaird commented Sep 19, 2021

sgbaird commented Sep 19, 2021 •

edited

Loading

CrabNet sometimes ignores/skips certain compounds. Why? How to keep track of compound IDs? #13

CrabNet sometimes ignores/skips certain compounds. Why? How to keep track of compound IDs? #13

Comments

sgbaird commented Sep 15, 2021

anthony-wang commented Sep 16, 2021

sgbaird commented Sep 19, 2021

sgbaird commented Sep 19, 2021

sgbaird commented Sep 19, 2021 • edited Loading

sgbaird commented Sep 19, 2021 •

edited

Loading