Spaces:
Running
Running
Update web.py
Browse files
web.py
CHANGED
|
@@ -216,6 +216,8 @@ def web_data():
|
|
| 216 |
style="margin-top: 20px;",
|
| 217 |
),
|
| 218 |
H3("1. Document Preparation"),
|
|
|
|
|
|
|
| 219 |
H4("1.1 Text Extraction"),
|
| 220 |
P("""
|
| 221 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
|
@@ -224,7 +226,8 @@ def web_data():
|
|
| 224 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
| 225 |
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
| 226 |
"""),
|
| 227 |
-
DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
|
|
|
| 228 |
H4("1.2 Language Identification"),
|
| 229 |
P("""
|
| 230 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
|
|
|
| 216 |
style="margin-top: 20px;",
|
| 217 |
),
|
| 218 |
H3("1. Document Preparation"),
|
| 219 |
+
|
| 220 |
+
button( Div(
|
| 221 |
H4("1.1 Text Extraction"),
|
| 222 |
P("""
|
| 223 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
|
|
|
| 226 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
| 227 |
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
| 228 |
"""),
|
| 229 |
+
DV2("data/sample_wet.json", "data/sample_warc.json", 3),), cls="collapsible"),
|
| 230 |
+
|
| 231 |
H4("1.2 Language Identification"),
|
| 232 |
P("""
|
| 233 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|