Spaces:

LLM360
/

TxT360

Running

victormiller commited on Sep 26, 2024

Commit

146aa07

verified ·

1 Parent(s): 466af30

Update web.py

Files changed (1) hide show

web.py CHANGED Viewed

@@ -216,6 +216,8 @@ def web_data():
             style="margin-top: 20px;",
         ),
         H3("1. Document Preparation"),
         H4("1.1 Text Extraction"),
         P("""
         Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
@@ -224,7 +226,8 @@ def web_data():
         we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
         Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
         """),
-        DV2("data/sample_wet.json", "data/sample_warc.json", 3),
         H4("1.2 Language Identification"),
         P("""
         After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.

             style="margin-top: 20px;",
         ),
         H3("1. Document Preparation"),
+        button( Div(
         H4("1.1 Text Extraction"),
         P("""
         Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
         we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
         Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
         """),
+        DV2("data/sample_wet.json", "data/sample_warc.json", 3),), cls="collapsible"),
         H4("1.2 Language Identification"),
         P("""
         After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.