datasets/README.md · wangjin2000/ESM2Bind at edf49006476b9f3ff473ab4b1ddaa98fe5ccce93

metadata

license: mit

This is a refined version of a dataset obtained from UniProt (see here). The data was first sorted by family, then random families were selected until approximately 20% of the data was separates out for test data. Next, each sequences longer than 1000 residues was segmented into non-overlapping sections of 1000 amino acids or less. Any sequences with only partial binding site annotations were thrown out (any sequences with <, >, or ?).

Note: Copied from https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family