Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Running

App Files Files Community

OSainz commited on Apr 29, 2024

Commit

99a8650

verified ·

1 Parent(s): 888fb82

File fixes and cleaning (#17)

Browse files

- Add changes (23add198b9aed3461771ec64c740e7c2f6789dd1)
- Add info about the changes in the markdown. (4a1e5cc01386ce466b5172d77f8d97e0792609f9)

Files changed (4) hide show

contamination_report.csv +0 -0
dataset.py +2 -1
markdown.py +2 -1
postprocessing.py +43 -0

contamination_report.csv CHANGED Viewed

The diff for this file is too large to render. See raw diff

dataset.py CHANGED Viewed

@@ -256,7 +256,7 @@ def get_dataframe():
     # For "Contaminated Source" use build_dataset_url if "Model or corpus" is "corpus" and build_model_url if "Model or corpus" is "model"
     data["Contaminated Source"] = data.apply(
         lambda x: build_text_icon(
-            text=x["Contaminated Source"],
             url=dataset_url_dict.get(x["Contaminated Source"], "")
             if x["Model or corpus"] == "corpus"
             else model_url_dict.get(x["Contaminated Source"], ""),
@@ -264,6 +264,7 @@ def get_dataframe():
         ),
         axis=1,
     )
     data["Train Split"] = data["Train Split"].apply(lambda x: x/100 if x else x)
     data["Development Split"] = data["Development Split"].apply(lambda x: x/100 if x else x)

     # For "Contaminated Source" use build_dataset_url if "Model or corpus" is "corpus" and build_model_url if "Model or corpus" is "model"
     data["Contaminated Source"] = data.apply(
         lambda x: build_text_icon(
+            text=x["Contaminated Source"] + f" ({x['Version']})" if pd.notna(x["Version"]) else x["Contaminated Source"],
             url=dataset_url_dict.get(x["Contaminated Source"], "")
             if x["Model or corpus"] == "corpus"
             else model_url_dict.get(x["Contaminated Source"], ""),
         ),
         axis=1,
     )
+    del data["Version"]
     data["Train Split"] = data["Train Split"].apply(lambda x: x/100 if x else x)
     data["Development Split"] = data["Development Split"].apply(lambda x: x/100 if x else x)

markdown.py CHANGED Viewed

@@ -60,8 +60,9 @@ Citation: `@inproceedings{...`
 The [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/blob/main/contamination_report.csv) file is a csv filed with `;` delimiters. You will need to update the following columns:
 - **Evaluation Dataset**: Name of the evaluation dataset that has has (not) been compromised. If available in the HuggingFace Hub please write the path  (e.g. `uonlp/CulturaX`), otherwise  proviede the name of the dataset.
-- **Subset**: Many HuggingFace datasets have different subsets or splits on a single dataset. This field is to define a particular subset of a given dataset. For example, `qnli` subset of `glue`.
 - **Contaminated Source**: Name of the model that has been trained with the evaluation dataset or name of the pre-training copora that contains the evaluation datset. If available in the HuggingFace Hub please write the path  (e.g. `allenai/OLMo-7B`), otherwise proviede the name of the model/dataset.
 - **Train split**: Percentage of the train split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split.
 - **Development split**: Percentage of the development split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised.
 - **Train split**: Percentage of the test split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised.  If the dataset doesn't have splits, you can consider that the full dataset is a train or test split.

 The [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/blob/main/contamination_report.csv) file is a csv filed with `;` delimiters. You will need to update the following columns:
 - **Evaluation Dataset**: Name of the evaluation dataset that has has (not) been compromised. If available in the HuggingFace Hub please write the path  (e.g. `uonlp/CulturaX`), otherwise  proviede the name of the dataset.
+- **Subset**: (Optional) Many HuggingFace datasets have different subsets or splits on a single dataset. This field is to define a particular subset of a given dataset. For example, `qnli` subset of `glue`.
 - **Contaminated Source**: Name of the model that has been trained with the evaluation dataset or name of the pre-training copora that contains the evaluation datset. If available in the HuggingFace Hub please write the path  (e.g. `allenai/OLMo-7B`), otherwise proviede the name of the model/dataset.
+- **Version**: (Optional) Any information relevant to identify the version of the model or dataset. This information will be shown between parentheses in the Contaminated Source column.
 - **Train split**: Percentage of the train split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split.
 - **Development split**: Percentage of the development split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised.
 - **Train split**: Percentage of the test split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised.  If the dataset doesn't have splits, you can consider that the full dataset is a train or test split.

postprocessing.py ADDED Viewed

	@@ -0,0 +1,43 @@

+def load_file(filename):
+    with open(filename, 'r') as f:
+        header = f.readline().strip().split(";")
+        return header, [line.strip().split(";") for line in f if line.strip()]
+def remove_duplicates(data):
+    keys = set()
+    _data = []
+    for item in data:
+        key = tuple((item[0], item[1], item[2], item[3], item[-1]))
+        if key in keys:
+            continue
+        _data += [item]
+        keys.add(key)
+    return _data
+def fix_arxiv_links(data):
+    return [[*item[:-2], item[-2].replace("arxiv.org/pdf", "arxiv.org/abs"), item[-1]] for item in data]
+def sort_data(data):
+    return sorted(data, key=lambda x: (x[0], x[1], x[2], x[3], x[-1]))
+def main():
+    header, data = load_file("contamination_report.csv")
+    data = sort_data(data)
+    data = remove_duplicates(data)
+    data = fix_arxiv_links(data)
+    print("Total datapoints:", len(data))
+    with open("contamination_report.csv", 'w') as f:
+        f.write(";".join(header) + "\n")
+        past_key = None
+        for line in data:
+            key = tuple((line[0], line[1]))
+            if key != past_key:
+                f.write("\n")
+                past_key = key
+            line = line[:3] + [""] + line[3:]
+            f.write(";".join(line) + "\n")
+if __name__ == "__main__":
+    main()