ESM2Bind / datasets /README.md
wangjin2000's picture
Upload 5 files
5c47379 verified
|
raw
history blame
788 Bytes
metadata
license: mit

This is a refined version of a dataset obtained from UniProt (see here). The data was first sorted by family, then random families were selected until approximately 20% of the data was separates out for test data. Next, each sequences longer than 1000 residues was segmented into non-overlapping sections of 1000 amino acids or less. Any sequences with only partial binding site annotations were thrown out (any sequences with <, >, or ?).

Note: Copied from https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family