Spaces:
Running
Running
2025-06-04 18:24:36,828 - WARNING - Using default email for Entrez. Set ENTREZ_EMAIL environment variable. | |
2025-06-04 18:24:36,828 - INFO - Starting scientific corpus build... | |
2025-06-04 18:24:36,828 - INFO - Fetching arXiv papers... | |
2025-06-04 18:24:36,828 - INFO - Starting arXiv paper collection... | |
2025-06-04 18:24:36,828 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-04 18:24:37,582 - INFO - Got first page: 10 of 1234847 total results | |
2025-06-04 18:24:37,584 - INFO - Sleeping: 2.891590 seconds | |
2025-06-04 18:24:40,480 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=10&max_results=100 | |
2025-06-04 18:24:41,497 - INFO - Sleeping: 2.834623 seconds | |
2025-06-04 18:24:44,337 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=110&max_results=100 | |
2025-06-04 18:24:45,793 - INFO - Sleeping: 2.891338 seconds | |
2025-06-04 18:24:48,693 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=210&max_results=100 | |
2025-06-04 18:24:50,445 - INFO - Sleeping: 2.889250 seconds | |
2025-06-04 18:24:53,340 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=310&max_results=100 | |
2025-06-04 18:24:55,121 - INFO - Sleeping: 2.853944 seconds | |
2025-06-04 18:24:57,990 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=410&max_results=100 | |
2025-06-04 18:24:59,446 - INFO - Sleeping: 2.894095 seconds | |
2025-06-04 18:25:02,352 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=510&max_results=100 | |
2025-06-04 18:25:04,142 - INFO - Sleeping: 2.883096 seconds | |
2025-06-04 18:25:07,027 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=610&max_results=100 | |
2025-06-04 18:25:09,682 - INFO - Sleeping: 2.885210 seconds | |
2025-06-04 18:25:12,569 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=710&max_results=100 | |
2025-06-04 18:25:13,307 - INFO - Sleeping: 2.999002 seconds | |
2025-06-04 18:25:16,312 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=710&max_results=100 | |
2025-06-04 18:25:16,656 - INFO - Sleeping: 2.999005 seconds | |
2025-06-04 18:25:19,670 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=710&max_results=100 | |
2025-06-04 18:25:20,665 - INFO - Sleeping: 2.998205 seconds | |
2025-06-04 18:25:23,672 - INFO - Requesting page (first: False, try: 3): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=710&max_results=100 | |
2025-06-04 18:25:24,411 - WARNING - Empty page returned for query 'cat:physics* OR cat:astro-ph* OR cat:cond-mat* OR cat:hep-th OR cat:quant-ph OR cat:math-ph': Page of results was unexpectedly empty (https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=710&max_results=100) | |
2025-06-04 18:25:24,413 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-04 18:25:25,229 - INFO - Got first page: 100 of 50157 total results | |
2025-06-04 18:25:25,235 - INFO - Sleeping: 2.889962 seconds | |
2025-06-04 18:25:28,131 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-04 18:25:28,772 - INFO - Sleeping: 2.883966 seconds | |
2025-06-04 18:25:31,668 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-04 18:25:32,299 - INFO - Sleeping: 2.881156 seconds | |
2025-06-04 18:25:35,183 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-04 18:25:35,812 - INFO - Sleeping: 2.888015 seconds | |
2025-06-04 18:25:38,715 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-04 18:25:39,391 - INFO - Sleeping: 2.860997 seconds | |
2025-06-04 18:25:42,256 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-04 18:25:42,901 - INFO - Sleeping: 2.886973 seconds | |
2025-06-04 18:25:45,801 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-04 18:25:46,480 - INFO - Sleeping: 2.868057 seconds | |
2025-06-04 18:25:49,355 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-04 18:25:50,059 - INFO - Sleeping: 2.887865 seconds | |
2025-06-04 18:25:52,955 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-04 18:25:53,682 - INFO - Sleeping: 2.890004 seconds | |
2025-06-04 18:25:56,587 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-04 18:25:57,321 - INFO - Sleeping: 2.890996 seconds | |
2025-06-04 18:26:00,218 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 18:26:00,957 - INFO - Sleeping: 2.881048 seconds | |
2025-06-04 18:26:03,843 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1100&max_results=100 | |
2025-06-04 18:26:04,558 - INFO - Sleeping: 2.882004 seconds | |
2025-06-04 18:26:07,445 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1200&max_results=100 | |
2025-06-04 18:26:08,186 - INFO - Sleeping: 2.884004 seconds | |
2025-06-04 18:26:11,080 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-04 18:26:11,791 - INFO - Sleeping: 2.879006 seconds | |
2025-06-04 18:26:14,672 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-04 18:26:15,360 - INFO - Sleeping: 2.884253 seconds | |
2025-06-04 18:26:18,258 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1500&max_results=100 | |
2025-06-04 18:26:19,074 - INFO - Sleeping: 2.780152 seconds | |
2025-06-04 18:26:21,865 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1600&max_results=100 | |
2025-06-04 18:26:22,592 - INFO - Sleeping: 2.889969 seconds | |
2025-06-04 18:26:25,486 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-04 18:26:26,248 - INFO - Sleeping: 2.885832 seconds | |
2025-06-04 18:26:29,137 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-04 18:26:29,847 - INFO - Sleeping: 2.888996 seconds | |
2025-06-04 18:26:32,736 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1900&max_results=100 | |
2025-06-04 18:26:33,418 - INFO - Sleeping: 2.891001 seconds | |
2025-06-04 18:26:36,314 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2000&max_results=100 | |
2025-06-04 18:26:37,020 - INFO - Sleeping: 2.894964 seconds | |
2025-06-04 18:26:39,917 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2100&max_results=100 | |
2025-06-04 18:26:40,672 - INFO - Sleeping: 2.873007 seconds | |
2025-06-04 18:26:43,560 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2200&max_results=100 | |
2025-06-04 18:26:44,380 - INFO - Sleeping: 2.876039 seconds | |
2025-06-04 18:26:47,263 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2300&max_results=100 | |
2025-06-04 18:26:47,456 - INFO - Sleeping: 2.998000 seconds | |
2025-06-04 18:26:50,465 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2300&max_results=100 | |
2025-06-04 18:26:51,213 - INFO - Sleeping: 2.894004 seconds | |
2025-06-04 18:26:54,109 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2400&max_results=100 | |
2025-06-04 18:26:54,865 - INFO - Sleeping: 2.891856 seconds | |
2025-06-04 18:26:57,773 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2500&max_results=100 | |
2025-06-04 18:26:58,504 - INFO - Sleeping: 2.887003 seconds | |
2025-06-04 18:27:01,399 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2600&max_results=100 | |
2025-06-04 18:27:02,172 - INFO - Sleeping: 2.877958 seconds | |
2025-06-04 18:27:05,058 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2700&max_results=100 | |
2025-06-04 18:27:05,833 - INFO - Sleeping: 2.887216 seconds | |
2025-06-04 18:27:08,731 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2800&max_results=100 | |
2025-06-04 18:27:09,500 - INFO - Sleeping: 2.891250 seconds | |
2025-06-04 18:27:12,403 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2900&max_results=100 | |
2025-06-04 18:27:13,237 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-04 18:27:13,857 - INFO - Got first page: 100 of 100102 total results | |
2025-06-04 18:27:13,862 - INFO - Sleeping: 2.884006 seconds | |
2025-06-04 18:27:16,763 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-04 18:27:17,329 - INFO - Sleeping: 2.881001 seconds | |
2025-06-04 18:27:20,218 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-04 18:27:20,850 - INFO - Sleeping: 2.879202 seconds | |
2025-06-04 18:27:23,745 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-04 18:27:24,411 - INFO - Sleeping: 2.880967 seconds | |
2025-06-04 18:27:27,306 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-04 18:27:28,029 - INFO - Sleeping: 2.874096 seconds | |
2025-06-04 18:27:30,915 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-04 18:27:31,631 - INFO - Sleeping: 2.880004 seconds | |
2025-06-04 18:27:34,520 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-04 18:27:35,256 - INFO - Sleeping: 2.877006 seconds | |
2025-06-04 18:27:38,139 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-04 18:27:38,845 - INFO - Sleeping: 2.859003 seconds | |
2025-06-04 18:27:41,709 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-04 18:27:42,463 - INFO - Sleeping: 2.882012 seconds | |
2025-06-04 18:27:45,354 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-04 18:27:46,168 - INFO - Sleeping: 2.875090 seconds | |
2025-06-04 18:27:49,045 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 18:27:49,236 - INFO - Sleeping: 2.999005 seconds | |
2025-06-04 18:27:52,243 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 18:27:53,064 - INFO - Sleeping: 2.851008 seconds | |
2025-06-04 18:27:55,930 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1100&max_results=100 | |
2025-06-04 18:27:56,793 - INFO - Sleeping: 2.871210 seconds | |
2025-06-04 18:27:59,667 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1200&max_results=100 | |
2025-06-04 18:28:00,465 - INFO - Sleeping: 2.873985 seconds | |
2025-06-04 18:28:03,352 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-04 18:28:04,217 - INFO - Sleeping: 2.871961 seconds | |
2025-06-04 18:28:07,096 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-04 18:28:07,952 - INFO - Sleeping: 2.883991 seconds | |
2025-06-04 18:28:10,842 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1500&max_results=100 | |
2025-06-04 18:28:11,902 - INFO - Sleeping: 2.741968 seconds | |
2025-06-04 18:28:14,647 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1600&max_results=100 | |
2025-06-04 18:28:15,512 - INFO - Sleeping: 2.873990 seconds | |
2025-06-04 18:28:18,388 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-04 18:28:19,247 - INFO - Sleeping: 2.868484 seconds | |
2025-06-04 18:28:22,116 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-04 18:28:23,027 - INFO - Sleeping: 2.872947 seconds | |
2025-06-04 18:28:25,914 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1900&max_results=100 | |
2025-06-04 18:28:26,749 - INFO - Sleeping: 2.873965 seconds | |
2025-06-04 18:28:29,638 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2000&max_results=100 | |
2025-06-04 18:28:29,866 - INFO - Sleeping: 2.998144 seconds | |
2025-06-04 18:28:32,868 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2000&max_results=100 | |
2025-06-04 18:28:33,105 - INFO - Sleeping: 2.997999 seconds | |
2025-06-04 18:28:36,103 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2000&max_results=100 | |
2025-06-04 18:28:37,040 - INFO - Sleeping: 2.873968 seconds | |
2025-06-04 18:28:39,922 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2100&max_results=100 | |
2025-06-04 18:28:40,804 - INFO - Sleeping: 2.871206 seconds | |
2025-06-04 18:28:43,688 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2200&max_results=100 | |
2025-06-04 18:28:44,512 - INFO - Sleeping: 2.870001 seconds | |
2025-06-04 18:28:47,384 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2300&max_results=100 | |
2025-06-04 18:28:48,249 - INFO - Sleeping: 2.855996 seconds | |
2025-06-04 18:28:51,117 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2400&max_results=100 | |
2025-06-04 18:28:52,000 - INFO - Sleeping: 2.849823 seconds | |
2025-06-04 18:28:54,864 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2500&max_results=100 | |
2025-06-04 18:28:55,109 - INFO - Sleeping: 2.998202 seconds | |
2025-06-04 18:28:58,119 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2500&max_results=100 | |
2025-06-04 18:28:58,940 - INFO - Sleeping: 2.869240 seconds | |
2025-06-04 18:29:01,816 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2600&max_results=100 | |
2025-06-04 18:29:02,833 - INFO - Sleeping: 2.808999 seconds | |
2025-06-04 18:29:05,652 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2700&max_results=100 | |
2025-06-04 18:29:06,497 - INFO - Sleeping: 2.876108 seconds | |
2025-06-04 18:29:09,375 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2800&max_results=100 | |
2025-06-04 18:29:10,250 - INFO - Sleeping: 2.875218 seconds | |
2025-06-04 18:29:13,134 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2900&max_results=100 | |
2025-06-04 18:29:14,129 - INFO - Saved checkpoint to scientific_corpus_data\arxiv_papers.jsonl | |
2025-06-04 18:29:14,129 - INFO - Collected 5989 arXiv papers in 277.30s | |
2025-06-04 18:29:14,142 - INFO - Added 5989 papers from arXiv | |
2025-06-04 18:29:14,142 - INFO - Fetching PubMed papers... | |
2025-06-04 18:29:14,142 - INFO - Starting PubMed paper collection... | |
2025-06-04 18:29:31,294 - INFO - Saved checkpoint to scientific_corpus_data\pubmed_papers.jsonl | |
2025-06-04 18:29:31,294 - INFO - Collected 2668 PubMed papers in 17.15s | |
2025-06-04 18:29:31,294 - INFO - Processing 2668 biology papers... | |
2025-06-04 18:29:31,658 - INFO - Processed 2600/2668 biology papers | |
2025-06-04 18:29:31,658 - INFO - Added 2600 papers from PubMed | |
2025-06-04 18:29:31,658 - INFO - Fetching FineWeb-Edu papers... | |
2025-06-04 18:29:31,658 - INFO - Starting FineWeb-Edu collection... | |
2025-06-04 18:29:50,911 - INFO - Collected 10000 FineWeb samples | |
2025-06-04 18:29:54,120 - INFO - Collected 20000 FineWeb samples | |
2025-06-04 18:29:57,244 - INFO - Collected 30000 FineWeb samples | |
2025-06-04 18:29:57,246 - INFO - Processing 30000 FineWeb samples | |
2025-06-04 18:30:12,628 - INFO - Saved checkpoint to scientific_corpus_data\fineweb_edu.jsonl | |
2025-06-04 18:30:12,628 - INFO - Collected 29616 FineWeb-Edu papers in 40.97s | |
2025-06-04 18:30:12,645 - INFO - Processing 10000 education papers... | |
2025-06-04 18:30:24,154 - INFO - Processed 51990/10000 education papers | |
2025-06-04 18:30:24,155 - INFO - Processing 10000 education papers... | |
2025-06-04 18:30:33,041 - INFO - Processed 53216/10000 education papers | |
2025-06-04 18:30:33,043 - INFO - Processing 9616 education papers... | |
2025-06-04 18:30:42,148 - INFO - Processed 54196/9616 education papers | |
2025-06-04 18:30:42,177 - INFO - Added 159402 papers from FineWeb-Edu | |
2025-06-04 18:30:42,178 - INFO - Total papers collected: 167991 | |
2025-06-04 18:30:42,178 - INFO - Ranking and deduplicating papers... | |
2025-06-04 18:30:42,734 - INFO - Final corpus size: 0 papers | |
2025-06-04 18:30:42,734 - ERROR - Final corpus is empty. No data to process or save. | |
2025-06-04 19:40:24,209 - WARNING - Using default email for Entrez. Set ENTREZ_EMAIL environment variable. | |
2025-06-04 19:40:24,209 - INFO - Starting scientific corpus build... | |
2025-06-04 19:40:24,209 - INFO - Fetching arXiv papers... | |
2025-06-04 19:40:24,209 - INFO - Starting arXiv paper collection... | |
2025-06-04 19:40:24,211 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-04 19:40:26,342 - INFO - Got first page: 100 of 1234847 total results | |
2025-06-04 19:40:26,373 - INFO - Sleeping: 2.161671 seconds | |
2025-06-04 19:40:28,551 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-04 19:40:30,750 - INFO - Sleeping: 2.517183 seconds | |
2025-06-04 19:40:33,288 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-04 19:40:34,964 - INFO - Sleeping: 2.658016 seconds | |
2025-06-04 19:40:37,632 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-04 19:40:38,385 - INFO - Sleeping: 2.995441 seconds | |
2025-06-04 19:40:41,381 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-04 19:40:42,815 - INFO - Sleeping: 2.667007 seconds | |
2025-06-04 19:40:45,485 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-04 19:40:45,829 - INFO - Sleeping: 2.996727 seconds | |
2025-06-04 19:40:48,842 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-04 19:40:50,616 - INFO - Sleeping: 2.752690 seconds | |
2025-06-04 19:40:53,384 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-04 19:40:53,901 - INFO - Sleeping: 2.997476 seconds | |
2025-06-04 19:40:56,914 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-04 19:40:58,673 - INFO - Sleeping: 2.829839 seconds | |
2025-06-04 19:41:01,507 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-04 19:41:03,501 - INFO - Sleeping: 2.792955 seconds | |
2025-06-04 19:41:06,309 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-04 19:41:08,282 - INFO - Sleeping: 2.758867 seconds | |
2025-06-04 19:41:11,044 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-04 19:41:13,260 - INFO - Sleeping: 2.801440 seconds | |
2025-06-04 19:41:16,072 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-04 19:41:18,313 - INFO - Sleeping: 2.809029 seconds | |
2025-06-04 19:41:21,125 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 19:41:22,331 - INFO - Sleeping: 2.998355 seconds | |
2025-06-04 19:41:25,333 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 19:41:26,517 - INFO - Sleeping: 2.998692 seconds | |
2025-06-04 19:41:29,521 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 19:41:30,661 - INFO - Sleeping: 2.996485 seconds | |
2025-06-04 19:41:33,670 - INFO - Requesting page (first: False, try: 3): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 19:41:34,027 - WARNING - Empty page returned for query 'cat:physics* OR cat:astro-ph* OR cat:cond-mat* OR cat:hep-th OR cat:quant-ph OR cat:math-ph': Page of results was unexpectedly empty (https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100) | |
2025-06-04 19:41:34,030 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-04 19:41:39,322 - INFO - Got first page: 100 of 50157 total results | |
2025-06-04 19:41:39,329 - INFO - Sleeping: 2.802308 seconds | |
2025-06-04 19:41:42,134 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-04 19:41:42,922 - INFO - Sleeping: 2.757453 seconds | |
2025-06-04 19:41:45,693 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-04 19:41:46,544 - INFO - Sleeping: 2.781045 seconds | |
2025-06-04 19:41:49,328 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-04 19:41:50,230 - INFO - Sleeping: 2.646360 seconds | |
2025-06-04 19:41:52,884 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-04 19:41:53,672 - INFO - Sleeping: 2.735281 seconds | |
2025-06-04 19:41:56,414 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-04 19:41:57,461 - INFO - Sleeping: 2.597322 seconds | |
2025-06-04 19:42:00,064 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-04 19:42:01,119 - INFO - Sleeping: 2.659238 seconds | |
2025-06-04 19:42:03,787 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-04 19:42:04,817 - INFO - Sleeping: 2.744527 seconds | |
2025-06-04 19:42:07,605 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-04 19:42:08,580 - INFO - Sleeping: 2.746655 seconds | |
2025-06-04 19:42:11,333 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-04 19:42:12,173 - INFO - Sleeping: 2.807134 seconds | |
2025-06-04 19:42:14,984 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 19:42:15,893 - INFO - Sleeping: 2.773750 seconds | |
2025-06-04 19:42:18,668 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1100&max_results=100 | |
2025-06-04 19:42:19,719 - INFO - Sleeping: 2.835739 seconds | |
2025-06-04 19:42:22,557 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1200&max_results=100 | |
2025-06-04 19:42:23,649 - INFO - Sleeping: 2.690070 seconds | |
2025-06-04 19:42:26,343 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-04 19:42:27,248 - INFO - Sleeping: 2.788749 seconds | |
2025-06-04 19:42:30,039 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-04 19:42:30,215 - INFO - Sleeping: 2.998158 seconds | |
2025-06-04 19:42:33,218 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-04 19:42:33,369 - INFO - Sleeping: 3.000000 seconds | |
2025-06-04 19:42:36,370 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-04 19:42:36,585 - INFO - Sleeping: 2.998000 seconds | |
2025-06-04 19:42:39,591 - INFO - Requesting page (first: False, try: 3): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-04 19:42:40,451 - INFO - Sleeping: 2.824231 seconds | |
2025-06-04 19:42:43,277 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1500&max_results=100 | |
2025-06-04 19:42:44,140 - INFO - Sleeping: 2.837213 seconds | |
2025-06-04 19:42:46,981 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1600&max_results=100 | |
2025-06-04 19:42:47,819 - INFO - Sleeping: 2.845726 seconds | |
2025-06-04 19:42:50,668 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-04 19:42:50,859 - INFO - Sleeping: 2.998456 seconds | |
2025-06-04 19:42:53,871 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-04 19:42:54,097 - INFO - Sleeping: 2.996971 seconds | |
2025-06-04 19:42:57,107 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-04 19:42:57,364 - INFO - Sleeping: 2.994991 seconds | |
2025-06-04 19:43:00,360 - INFO - Requesting page (first: False, try: 3): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-04 19:43:00,569 - WARNING - Empty page returned for query 'cat:q-bio*': Page of results was unexpectedly empty (https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100) | |
2025-06-04 19:43:00,570 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-04 19:43:08,739 - INFO - Got first page: 100 of 100102 total results | |
2025-06-04 19:43:08,749 - INFO - Sleeping: 2.824554 seconds | |
2025-06-04 19:43:11,576 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-04 19:43:12,432 - INFO - Sleeping: 2.796477 seconds | |
2025-06-04 19:43:15,242 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-04 19:43:16,019 - INFO - Sleeping: 2.815200 seconds | |
2025-06-04 19:43:18,971 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-04 19:43:19,932 - INFO - Sleeping: 2.792352 seconds | |
2025-06-04 19:43:22,732 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-04 19:43:23,643 - INFO - Sleeping: 2.809260 seconds | |
2025-06-04 19:43:26,465 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-04 19:43:27,409 - INFO - Sleeping: 2.813348 seconds | |
2025-06-04 19:43:30,225 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-04 19:43:31,041 - INFO - Sleeping: 2.815501 seconds | |
2025-06-04 19:43:33,861 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-04 19:43:34,796 - INFO - Sleeping: 2.694598 seconds | |
2025-06-04 19:43:37,497 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-04 19:43:38,380 - INFO - Sleeping: 2.796724 seconds | |
2025-06-04 19:43:41,179 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-04 19:43:42,204 - INFO - Sleeping: 2.802707 seconds | |
2025-06-04 19:43:45,021 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 19:43:45,477 - INFO - Sleeping: 2.996323 seconds | |
2025-06-04 19:43:48,475 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-04 19:43:49,320 - INFO - Sleeping: 2.811059 seconds | |
2025-06-04 19:43:52,145 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1100&max_results=100 | |
2025-06-04 19:43:53,122 - INFO - Sleeping: 2.692243 seconds | |
2025-06-04 19:43:55,823 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1200&max_results=100 | |
2025-06-04 19:43:56,697 - INFO - Sleeping: 2.800169 seconds | |
2025-06-04 19:43:59,504 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-04 19:43:59,686 - INFO - Sleeping: 2.998983 seconds | |
2025-06-04 19:44:02,690 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-04 19:44:02,946 - INFO - Sleeping: 2.997965 seconds | |
2025-06-04 19:44:05,949 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-04 19:44:07,469 - INFO - Sleeping: 2.710882 seconds | |
2025-06-04 19:44:10,181 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-04 19:44:11,092 - INFO - Sleeping: 2.791871 seconds | |
2025-06-04 19:44:13,886 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1500&max_results=100 | |
2025-06-04 19:44:14,850 - INFO - Sleeping: 2.789295 seconds | |
2025-06-04 19:44:17,640 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1600&max_results=100 | |
2025-06-04 19:44:18,553 - INFO - Sleeping: 2.790857 seconds | |
2025-06-04 19:44:21,359 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-04 19:44:21,590 - INFO - Sleeping: 2.997492 seconds | |
2025-06-04 19:44:24,590 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-04 19:44:25,554 - INFO - Sleeping: 2.795911 seconds | |
2025-06-04 19:44:28,350 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-04 19:44:28,529 - INFO - Sleeping: 2.997998 seconds | |
2025-06-04 19:44:31,528 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-04 19:44:31,809 - INFO - Sleeping: 2.997435 seconds | |
2025-06-04 19:44:34,816 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-04 19:44:34,999 - INFO - Sleeping: 2.998468 seconds | |
2025-06-04 19:44:38,010 - INFO - Requesting page (first: False, try: 3): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-04 19:44:38,229 - WARNING - Empty page returned for query 'cat:cond-mat.mtrl-sci OR cat:materials*': Page of results was unexpectedly empty (https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100) | |
2025-06-04 19:44:38,231 - INFO - Saved checkpoint to scientific_corpus_data\arxiv_papers.jsonl | |
2025-06-04 19:44:38,231 - INFO - Collected 0 arXiv papers in 254.02s | |
2025-06-04 19:44:38,232 - INFO - Added 0 papers from arXiv | |
2025-06-04 19:44:38,232 - INFO - Fetching PubMed papers... | |
2025-06-04 19:44:38,232 - INFO - Starting PubMed paper collection... | |
2025-06-04 19:45:10,779 - INFO - Saved checkpoint to scientific_corpus_data\pubmed_papers.jsonl | |
2025-06-04 19:45:10,779 - INFO - Collected 2669 PubMed papers in 32.55s | |
2025-06-04 19:45:10,780 - INFO - Processing 2669 biology papers... | |
2025-06-04 19:45:12,211 - INFO - Processed 2602/2669 biology papers | |
2025-06-04 19:45:12,212 - INFO - Added 2602 papers from PubMed | |
2025-06-04 19:45:12,212 - INFO - Fetching FineWeb-Edu papers... | |
2025-06-04 19:45:12,213 - INFO - Starting FineWeb-Edu collection... | |
2025-06-04 19:45:51,730 - INFO - Collected 10000 FineWeb samples | |
2025-06-04 19:46:04,083 - INFO - Collected 20000 FineWeb samples | |
2025-06-04 19:46:17,655 - INFO - Collected 30000 FineWeb samples | |
2025-06-04 19:46:17,657 - INFO - Processing 30000 FineWeb samples | |
2025-06-04 19:46:43,962 - INFO - Saved checkpoint to scientific_corpus_data\fineweb_edu.jsonl | |
2025-06-04 19:46:43,962 - INFO - Collected 29616 FineWeb-Edu papers in 91.75s | |
2025-06-04 19:46:43,985 - INFO - Processing 10000 education papers... | |
2025-06-04 19:47:00,165 - INFO - Processed 51990/10000 education papers | |
2025-06-04 19:47:00,168 - INFO - Processing 10000 education papers... | |
2025-06-04 19:47:16,355 - INFO - Processed 53216/10000 education papers | |
2025-06-04 19:47:16,358 - INFO - Processing 9616 education papers... | |
2025-06-04 19:47:35,545 - INFO - Processed 54196/9616 education papers | |
2025-06-04 19:47:35,591 - INFO - Added 159402 papers from FineWeb-Edu | |
2025-06-04 19:47:35,591 - INFO - Total papers collected: 162004 | |
2025-06-04 19:47:35,591 - INFO - Ranking and deduplicating papers... | |
2025-06-04 19:47:36,117 - INFO - Final corpus size: 0 papers | |
2025-06-04 19:47:36,117 - ERROR - Final corpus is empty after ranking. Using unranked papers as fallback. | |
2025-06-04 19:47:41,304 - INFO - Saved checkpoint to scientific_corpus_data\ranked_papers.jsonl | |
2025-06-04 19:54:56,383 - INFO - Processing final dataset in batches... | |
2025-06-04 19:55:01,021 - INFO - Scientific corpus successfully built: scientific_corpus_325M.jsonl | |
2025-06-05 14:31:14,760 - WARNING - Using default email for Entrez. Set ENTREZ_EMAIL environment variable. | |
2025-06-05 14:31:14,760 - INFO - Starting scientific corpus build... | |
2025-06-05 14:31:14,760 - INFO - Fetching arXiv papers... | |
2025-06-05 14:31:14,760 - INFO - Starting arXiv paper collection... | |
2025-06-05 14:31:14,760 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-05 14:31:15,659 - INFO - Got first page: 100 of 1235159 total results | |
2025-06-05 14:31:15,664 - INFO - Sleeping: 2.876011 seconds | |
2025-06-05 14:31:18,548 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-05 14:31:20,356 - INFO - Sleeping: 2.868127 seconds | |
2025-06-05 14:31:23,228 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-05 14:31:25,450 - INFO - Sleeping: 2.878783 seconds | |
2025-06-05 14:31:28,329 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-05 14:31:29,045 - INFO - Sleeping: 2.997967 seconds | |
2025-06-05 14:31:32,053 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-05 14:31:33,480 - INFO - Sleeping: 2.833944 seconds | |
2025-06-05 14:31:36,330 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 14:31:37,006 - INFO - Sleeping: 2.998007 seconds | |
2025-06-05 14:31:40,013 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 14:31:40,368 - INFO - Sleeping: 2.998942 seconds | |
2025-06-05 14:31:43,379 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 14:31:45,204 - INFO - Sleeping: 2.890074 seconds | |
2025-06-05 14:31:48,110 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-05 14:31:49,802 - INFO - Sleeping: 2.886966 seconds | |
2025-06-05 14:31:52,694 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-05 14:31:55,124 - INFO - Sleeping: 2.854035 seconds | |
2025-06-05 14:31:57,980 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-05 14:31:58,299 - INFO - Sleeping: 2.999009 seconds | |
2025-06-05 14:32:01,312 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-05 14:32:02,917 - INFO - Sleeping: 2.886988 seconds | |
2025-06-05 14:32:05,810 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-05 14:32:08,174 - INFO - Sleeping: 2.890209 seconds | |
2025-06-05 14:32:11,079 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-05 14:32:13,258 - INFO - Sleeping: 2.875957 seconds | |
2025-06-05 14:32:16,147 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-05 14:32:16,895 - INFO - Sleeping: 2.998006 seconds | |
2025-06-05 14:32:19,905 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-05 14:32:20,235 - INFO - Sleeping: 2.998238 seconds | |
2025-06-05 14:32:23,249 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-05 14:32:24,512 - INFO - Sleeping: 2.998224 seconds | |
2025-06-05 14:32:27,512 - INFO - Requesting page (first: False, try: 3): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-05 14:32:28,732 - WARNING - Empty page returned for query 'cat:physics* OR cat:astro-ph* OR cat:cond-mat* OR cat:hep-th OR cat:quant-ph OR cat:math-ph': Page of results was unexpectedly empty (https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100) | |
2025-06-05 14:32:28,735 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-05 14:32:29,451 - INFO - Got first page: 10 of 50184 total results | |
2025-06-05 14:32:29,452 - INFO - Sleeping: 2.988009 seconds | |
2025-06-05 14:32:32,454 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=10&max_results=100 | |
2025-06-05 14:32:33,222 - INFO - Sleeping: 2.893003 seconds | |
2025-06-05 14:32:36,125 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=110&max_results=100 | |
2025-06-05 14:32:36,969 - INFO - Sleeping: 2.885008 seconds | |
2025-06-05 14:32:39,865 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=210&max_results=100 | |
2025-06-05 14:32:40,707 - INFO - Sleeping: 2.879990 seconds | |
2025-06-05 14:32:43,595 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=310&max_results=100 | |
2025-06-05 14:32:44,516 - INFO - Sleeping: 2.886200 seconds | |
2025-06-05 14:32:47,417 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=410&max_results=100 | |
2025-06-05 14:32:48,208 - INFO - Sleeping: 2.887005 seconds | |
2025-06-05 14:32:51,106 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=510&max_results=100 | |
2025-06-05 14:32:52,117 - INFO - Sleeping: 2.891963 seconds | |
2025-06-05 14:32:55,020 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=610&max_results=100 | |
2025-06-05 14:32:55,863 - INFO - Sleeping: 2.871965 seconds | |
2025-06-05 14:32:58,742 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=710&max_results=100 | |
2025-06-05 14:32:59,608 - INFO - Sleeping: 2.884966 seconds | |
2025-06-05 14:33:02,500 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=810&max_results=100 | |
2025-06-05 14:33:03,438 - INFO - Sleeping: 2.889214 seconds | |
2025-06-05 14:33:06,334 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=910&max_results=100 | |
2025-06-05 14:33:07,267 - INFO - Sleeping: 2.887106 seconds | |
2025-06-05 14:33:10,170 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1010&max_results=100 | |
2025-06-05 14:33:11,077 - INFO - Sleeping: 2.880256 seconds | |
2025-06-05 14:33:13,963 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1110&max_results=100 | |
2025-06-05 14:33:14,885 - INFO - Sleeping: 2.784197 seconds | |
2025-06-05 14:33:17,675 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1210&max_results=100 | |
2025-06-05 14:33:18,901 - INFO - Sleeping: 2.884003 seconds | |
2025-06-05 14:33:21,792 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1310&max_results=100 | |
2025-06-05 14:33:22,873 - INFO - Sleeping: 2.874248 seconds | |
2025-06-05 14:33:25,757 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1410&max_results=100 | |
2025-06-05 14:33:26,569 - INFO - Sleeping: 2.888043 seconds | |
2025-06-05 14:33:29,463 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1510&max_results=100 | |
2025-06-05 14:33:30,837 - INFO - Sleeping: 2.877951 seconds | |
2025-06-05 14:33:33,719 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1610&max_results=100 | |
2025-06-05 14:33:34,624 - INFO - Sleeping: 2.888004 seconds | |
2025-06-05 14:33:37,523 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1710&max_results=100 | |
2025-06-05 14:33:38,673 - INFO - Sleeping: 2.886049 seconds | |
2025-06-05 14:33:41,573 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1810&max_results=100 | |
2025-06-05 14:33:42,448 - INFO - Sleeping: 2.885806 seconds | |
2025-06-05 14:33:45,346 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1910&max_results=100 | |
2025-06-05 14:33:46,188 - INFO - Sleeping: 2.888998 seconds | |
2025-06-05 14:33:49,091 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2010&max_results=100 | |
2025-06-05 14:33:50,236 - INFO - Sleeping: 2.894005 seconds | |
2025-06-05 14:33:53,144 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2110&max_results=100 | |
2025-06-05 14:33:54,219 - INFO - Sleeping: 2.875000 seconds | |
2025-06-05 14:33:57,103 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2210&max_results=100 | |
2025-06-05 14:33:57,991 - INFO - Sleeping: 2.889005 seconds | |
2025-06-05 14:34:00,892 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2310&max_results=100 | |
2025-06-05 14:34:02,061 - INFO - Sleeping: 2.893965 seconds | |
2025-06-05 14:34:04,966 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2410&max_results=100 | |
2025-06-05 14:34:05,876 - INFO - Sleeping: 2.885999 seconds | |
2025-06-05 14:34:08,763 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2510&max_results=100 | |
2025-06-05 14:34:10,242 - INFO - Sleeping: 2.888012 seconds | |
2025-06-05 14:34:13,143 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2610&max_results=100 | |
2025-06-05 14:34:14,077 - INFO - Sleeping: 2.885134 seconds | |
2025-06-05 14:34:16,967 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2710&max_results=100 | |
2025-06-05 14:34:17,888 - INFO - Sleeping: 2.886009 seconds | |
2025-06-05 14:34:20,778 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2810&max_results=100 | |
2025-06-05 14:34:21,850 - INFO - Sleeping: 2.888005 seconds | |
2025-06-05 14:34:24,752 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2910&max_results=100 | |
2025-06-05 14:34:25,782 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-05 14:34:27,506 - INFO - Got first page: 100 of 100127 total results | |
2025-06-05 14:34:27,512 - INFO - Sleeping: 2.877989 seconds | |
2025-06-05 14:34:30,393 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-05 14:34:31,028 - INFO - Sleeping: 2.877958 seconds | |
2025-06-05 14:34:33,916 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-05 14:34:34,090 - INFO - Sleeping: 2.999003 seconds | |
2025-06-05 14:34:37,091 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-05 14:34:37,737 - INFO - Sleeping: 2.880005 seconds | |
2025-06-05 14:34:40,629 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-05 14:34:41,292 - INFO - Sleeping: 2.879003 seconds | |
2025-06-05 14:34:44,182 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 14:34:44,930 - INFO - Sleeping: 2.869964 seconds | |
2025-06-05 14:34:47,810 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-05 14:34:48,535 - INFO - Sleeping: 2.874956 seconds | |
2025-06-05 14:34:51,410 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-05 14:34:52,141 - INFO - Sleeping: 2.877994 seconds | |
2025-06-05 14:34:55,024 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-05 14:34:56,085 - INFO - Sleeping: 2.864949 seconds | |
2025-06-05 14:34:58,964 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-05 14:34:59,873 - INFO - Sleeping: 2.877992 seconds | |
2025-06-05 14:35:02,759 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-05 14:35:03,686 - INFO - Sleeping: 2.874266 seconds | |
2025-06-05 14:35:06,571 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-05 14:35:07,501 - INFO - Sleeping: 2.871005 seconds | |
2025-06-05 14:35:10,377 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1100&max_results=100 | |
2025-06-05 14:35:11,209 - INFO - Sleeping: 2.864960 seconds | |
2025-06-05 14:35:14,083 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1200&max_results=100 | |
2025-06-05 14:35:15,404 - INFO - Sleeping: 2.746202 seconds | |
2025-06-05 14:35:18,151 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-05 14:35:19,199 - INFO - Sleeping: 2.868003 seconds | |
2025-06-05 14:35:22,070 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-05 14:35:22,910 - INFO - Sleeping: 2.884035 seconds | |
2025-06-05 14:35:25,804 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1500&max_results=100 | |
2025-06-05 14:35:26,034 - INFO - Sleeping: 2.998002 seconds | |
2025-06-05 14:35:29,037 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1500&max_results=100 | |
2025-06-05 14:35:29,873 - INFO - Sleeping: 2.867136 seconds | |
2025-06-05 14:35:32,746 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1600&max_results=100 | |
2025-06-05 14:35:33,667 - INFO - Sleeping: 2.872990 seconds | |
2025-06-05 14:35:36,540 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-05 14:35:37,841 - INFO - Sleeping: 2.874205 seconds | |
2025-06-05 14:35:40,721 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-05 14:35:40,927 - INFO - Sleeping: 2.997974 seconds | |
2025-06-05 14:35:43,929 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-05 14:35:44,848 - INFO - Sleeping: 2.865966 seconds | |
2025-06-05 14:35:47,715 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1900&max_results=100 | |
2025-06-05 14:35:48,784 - INFO - Sleeping: 2.876990 seconds | |
2025-06-05 14:35:51,662 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2000&max_results=100 | |
2025-06-05 14:35:52,749 - INFO - Sleeping: 2.870965 seconds | |
2025-06-05 14:35:55,633 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2100&max_results=100 | |
2025-06-05 14:35:56,794 - INFO - Sleeping: 2.810096 seconds | |
2025-06-05 14:35:59,605 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2200&max_results=100 | |
2025-06-05 14:36:00,463 - INFO - Sleeping: 2.871957 seconds | |
2025-06-05 14:36:03,337 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2300&max_results=100 | |
2025-06-05 14:36:04,366 - INFO - Sleeping: 2.870964 seconds | |
2025-06-05 14:36:07,251 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2400&max_results=100 | |
2025-06-05 14:36:08,018 - INFO - Sleeping: 2.866220 seconds | |
2025-06-05 14:36:10,894 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2500&max_results=100 | |
2025-06-05 14:36:11,693 - INFO - Sleeping: 2.876956 seconds | |
2025-06-05 14:36:14,579 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2600&max_results=100 | |
2025-06-05 14:36:15,428 - INFO - Sleeping: 2.876213 seconds | |
2025-06-05 14:36:18,322 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2700&max_results=100 | |
2025-06-05 14:36:19,214 - INFO - Sleeping: 2.870772 seconds | |
2025-06-05 14:36:22,099 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2800&max_results=100 | |
2025-06-05 14:36:22,988 - INFO - Sleeping: 2.872093 seconds | |
2025-06-05 14:36:25,869 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2900&max_results=100 | |
2025-06-05 14:36:26,927 - INFO - Saved checkpoint to scientific_corpus_data\arxiv_papers.jsonl | |
2025-06-05 14:36:26,928 - INFO - Collected 5989 arXiv papers in 312.17s | |
2025-06-05 14:36:27,067 - INFO - Saved checkpoint to scientific_corpus_data\arxiv_papers.jsonl | |
2025-06-05 14:36:27,067 - INFO - Added 5989 papers from arXiv | |
2025-06-05 14:36:27,067 - INFO - Fetching PubMed papers... | |
2025-06-05 14:36:27,068 - INFO - Starting PubMed paper collection... | |
2025-06-05 14:36:34,649 - ERROR - Network error fetching PubMed batch: HTTP Error 500: Internal Server Error | |
2025-06-05 14:36:34,649 - WARNING - Error in _fetch_pubmed_batch: HTTP Error 500: Internal Server Error. Retrying in 2.0s... | |
2025-06-05 14:36:44,445 - ERROR - Network error fetching PubMed batch: HTTP Error 500: Internal Server Error | |
2025-06-05 14:36:44,445 - WARNING - Error in _fetch_pubmed_batch: HTTP Error 500: Internal Server Error. Retrying in 2.0s... | |
2025-06-05 14:36:47,190 - ERROR - Network error fetching PubMed batch: HTTP Error 500: Internal Server Error | |
2025-06-05 14:36:47,190 - WARNING - Error in _fetch_pubmed_batch: HTTP Error 500: Internal Server Error. Retrying in 4.0s... | |
2025-06-05 14:37:00,932 - INFO - Saved checkpoint to scientific_corpus_data\pubmed_papers.jsonl | |
2025-06-05 14:37:00,932 - INFO - Collected 2669 PubMed papers in 33.86s | |
2025-06-05 14:37:00,932 - INFO - Processing 1000 biology papers... | |
2025-06-05 14:37:01,064 - INFO - Processed 946/1000 biology papers | |
2025-06-05 14:37:01,065 - INFO - Processing 1000 biology papers... | |
2025-06-05 14:37:01,206 - INFO - Processed 991/1000 biology papers | |
2025-06-05 14:37:01,206 - INFO - Processing 669 biology papers... | |
2025-06-05 14:37:01,318 - INFO - Processed 665/669 biology papers | |
2025-06-05 14:37:01,364 - INFO - Saved checkpoint to scientific_corpus_data\pubmed_papers.jsonl | |
2025-06-05 14:37:01,364 - INFO - Added 2602 papers from PubMed | |
2025-06-05 14:37:01,364 - INFO - Fetching FineWeb-Edu papers... | |
2025-06-05 14:37:01,364 - INFO - Starting FineWeb-Edu collection... | |
2025-06-05 14:37:35,332 - INFO - Collected 10000 FineWeb samples | |
2025-06-05 14:37:40,494 - INFO - Collected 20000 FineWeb samples | |
2025-06-05 14:37:44,059 - INFO - Collected 30000 FineWeb samples | |
2025-06-05 14:37:44,565 - INFO - Processing 30000 FineWeb samples | |
2025-06-05 14:37:58,136 - INFO - Saved checkpoint to scientific_corpus_data\fineweb_edu.jsonl | |
2025-06-05 14:37:58,136 - INFO - Collected 29616 FineWeb-Edu papers in 56.77s | |
2025-06-05 14:37:58,150 - INFO - Processing 1000 education papers... | |
2025-06-05 14:37:58,936 - INFO - Processed 5354/1000 education papers | |
2025-06-05 14:37:58,936 - INFO - Processing 1000 education papers... | |
2025-06-05 14:37:59,762 - INFO - Processed 5622/1000 education papers | |
2025-06-05 14:37:59,762 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:00,580 - INFO - Processed 4975/1000 education papers | |
2025-06-05 14:38:00,580 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:01,315 - INFO - Processed 5011/1000 education papers | |
2025-06-05 14:38:01,315 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:02,097 - INFO - Processed 5349/1000 education papers | |
2025-06-05 14:38:02,098 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:03,073 - INFO - Processed 5667/1000 education papers | |
2025-06-05 14:38:03,074 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:03,846 - INFO - Processed 5081/1000 education papers | |
2025-06-05 14:38:03,846 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:04,524 - INFO - Processed 4592/1000 education papers | |
2025-06-05 14:38:04,524 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:05,315 - INFO - Processed 5222/1000 education papers | |
2025-06-05 14:38:05,316 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:06,118 - INFO - Processed 5117/1000 education papers | |
2025-06-05 14:38:06,118 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:06,895 - INFO - Processed 5179/1000 education papers | |
2025-06-05 14:38:06,895 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:07,617 - INFO - Processed 4847/1000 education papers | |
2025-06-05 14:38:07,618 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:08,395 - INFO - Processed 5101/1000 education papers | |
2025-06-05 14:38:08,396 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:09,159 - INFO - Processed 5065/1000 education papers | |
2025-06-05 14:38:09,160 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:09,995 - INFO - Processed 5452/1000 education papers | |
2025-06-05 14:38:09,995 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:10,812 - INFO - Processed 5244/1000 education papers | |
2025-06-05 14:38:10,812 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:11,646 - INFO - Processed 5388/1000 education papers | |
2025-06-05 14:38:11,646 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:12,390 - INFO - Processed 4964/1000 education papers | |
2025-06-05 14:38:12,390 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:13,249 - INFO - Processed 5677/1000 education papers | |
2025-06-05 14:38:13,250 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:14,217 - INFO - Processed 6299/1000 education papers | |
2025-06-05 14:38:14,217 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:15,164 - INFO - Processed 6131/1000 education papers | |
2025-06-05 14:38:15,164 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:16,151 - INFO - Processed 6091/1000 education papers | |
2025-06-05 14:38:16,152 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:17,131 - INFO - Processed 6183/1000 education papers | |
2025-06-05 14:38:17,131 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:18,014 - INFO - Processed 5664/1000 education papers | |
2025-06-05 14:38:18,014 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:18,892 - INFO - Processed 5700/1000 education papers | |
2025-06-05 14:38:18,892 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:19,642 - INFO - Processed 4994/1000 education papers | |
2025-06-05 14:38:19,642 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:20,513 - INFO - Processed 5757/1000 education papers | |
2025-06-05 14:38:20,513 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:21,344 - INFO - Processed 5505/1000 education papers | |
2025-06-05 14:38:21,345 - INFO - Processing 1000 education papers... | |
2025-06-05 14:38:22,302 - INFO - Processed 5139/1000 education papers | |
2025-06-05 14:38:22,302 - INFO - Processing 616 education papers... | |
2025-06-05 14:38:22,758 - INFO - Processed 3032/616 education papers | |
2025-06-05 14:38:25,704 - INFO - Saved checkpoint to scientific_corpus_data\fineweb-edu_papers.jsonl | |
2025-06-05 14:38:25,706 - INFO - Added 159402 papers from FineWeb-Edu | |
2025-06-05 14:38:25,706 - INFO - Total papers collected: 167993 | |
2025-06-05 14:38:25,706 - INFO - Ranking and deduplicating papers... | |
2025-06-05 14:38:26,046 - INFO - Final corpus size: 0 papers | |
2025-06-05 14:38:26,046 - ERROR - Final corpus is empty after ranking. Using unranked papers as fallback. | |
2025-06-05 14:38:29,046 - INFO - Saved checkpoint to scientific_corpus_data\ranked_papers.jsonl | |
2025-06-05 14:41:59,912 - INFO - Processing final dataset in batches... | |
2025-06-05 14:42:03,716 - INFO - Scientific corpus successfully built: scientific_corpus_325M.jsonl | |
2025-06-05 14:42:05,256 - WARNING - Cloning https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized into local empty directory. | |
2025-06-05 14:54:06,133 - WARNING - remote: [31m-------------------------------------------------------------------------[0m | |
remote: [31mYour push was rejected because it contains files larger than 10 MiB.[0m | |
remote: [31mPlease use https://git-lfs.github.com/ to store large files.[0m | |
remote: [31mSee also: https://hf.co/docs/hub/repositories-getting-started#terminal[0m | |
remote: [31m[0m | |
remote: [31mOffending files:[0m | |
remote: [31m - scientific_corpus_325M.jsonl (ref: refs/heads/main)[0m | |
remote: [31m-------------------------------------------------------------------------[0m | |
To https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized | |
! [remote rejected] main -> main (pre-receive hook declined) | |
error: failed to push some refs to 'https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized' | |
2025-06-05 14:54:06,429 - ERROR - Error during Hugging Face upload: remote: [31m-------------------------------------------------------------------------[0m | |
remote: [31mYour push was rejected because it contains files larger than 10 MiB.[0m | |
remote: [31mPlease use https://git-lfs.github.com/ to store large files.[0m | |
remote: [31mSee also: https://hf.co/docs/hub/repositories-getting-started#terminal[0m | |
remote: [31m[0m | |
remote: [31mOffending files:[0m | |
remote: [31m - scientific_corpus_325M.jsonl (ref: refs/heads/main)[0m | |
remote: [31m-------------------------------------------------------------------------[0m | |
To https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized | |
! [remote rejected] main -> main (pre-receive hook declined) | |
error: failed to push some refs to 'https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized' | |
2025-06-05 15:12:37,911 - WARNING - Using default email for Entrez. Set ENTREZ_EMAIL environment variable. | |
2025-06-05 15:12:37,911 - INFO - Starting scientific corpus build... | |
2025-06-05 15:12:37,911 - INFO - Fetching arXiv papers... | |
2025-06-05 15:12:37,912 - INFO - Starting arXiv paper collection... | |
2025-06-05 15:12:37,913 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-05 15:12:39,074 - INFO - Got first page: 100 of 1235159 total results | |
2025-06-05 15:12:39,081 - INFO - Sleeping: 2.829702 seconds | |
2025-06-05 15:12:41,926 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-05 15:12:42,486 - INFO - Sleeping: 2.995433 seconds | |
2025-06-05 15:12:45,491 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-05 15:12:46,106 - INFO - Sleeping: 2.996252 seconds | |
2025-06-05 15:12:49,106 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-05 15:12:50,164 - INFO - Sleeping: 2.831608 seconds | |
2025-06-05 15:12:52,998 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-05 15:12:54,892 - INFO - Sleeping: 2.844823 seconds | |
2025-06-05 15:12:57,750 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-05 15:12:59,750 - INFO - Sleeping: 2.789286 seconds | |
2025-06-05 15:13:02,552 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 15:13:03,374 - INFO - Sleeping: 2.996416 seconds | |
2025-06-05 15:13:06,378 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 15:13:07,371 - INFO - Sleeping: 2.999055 seconds | |
2025-06-05 15:13:10,381 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 15:13:11,098 - INFO - Sleeping: 2.998991 seconds | |
2025-06-05 15:13:14,113 - INFO - Requesting page (first: False, try: 3): https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 15:13:15,089 - WARNING - Empty page returned for query 'cat:physics* OR cat:astro-ph* OR cat:cond-mat* OR cat:hep-th OR cat:quant-ph OR cat:math-ph': Page of results was unexpectedly empty (https://export.arxiv.org/api/query?search_query=cat%3Aphysics%2A+OR+cat%3Aastro-ph%2A+OR+cat%3Acond-mat%2A+OR+cat%3Ahep-th+OR+cat%3Aquant-ph+OR+cat%3Amath-ph&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100) | |
2025-06-05 15:13:15,092 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-05 15:13:15,903 - INFO - Got first page: 100 of 50184 total results | |
2025-06-05 15:13:15,908 - INFO - Sleeping: 2.868669 seconds | |
2025-06-05 15:13:18,789 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-05 15:13:19,544 - INFO - Sleeping: 2.856774 seconds | |
2025-06-05 15:13:22,403 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-05 15:13:23,231 - INFO - Sleeping: 2.841385 seconds | |
2025-06-05 15:13:26,079 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-05 15:13:26,770 - INFO - Sleeping: 2.862628 seconds | |
2025-06-05 15:13:29,635 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 15:13:30,445 - INFO - Sleeping: 2.845824 seconds | |
2025-06-05 15:13:33,301 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-05 15:13:34,315 - INFO - Sleeping: 2.856134 seconds | |
2025-06-05 15:13:37,176 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-05 15:13:38,048 - INFO - Sleeping: 2.847935 seconds | |
2025-06-05 15:13:40,910 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-05 15:13:41,934 - INFO - Sleeping: 2.844453 seconds | |
2025-06-05 15:13:44,793 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-05 15:13:45,604 - INFO - Sleeping: 2.842624 seconds | |
2025-06-05 15:13:48,456 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-05 15:13:49,527 - INFO - Sleeping: 2.844088 seconds | |
2025-06-05 15:13:52,379 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-05 15:13:53,236 - INFO - Sleeping: 2.843688 seconds | |
2025-06-05 15:13:56,090 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1100&max_results=100 | |
2025-06-05 15:13:56,952 - INFO - Sleeping: 2.848608 seconds | |
2025-06-05 15:13:59,812 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1200&max_results=100 | |
2025-06-05 15:14:00,815 - INFO - Sleeping: 2.860489 seconds | |
2025-06-05 15:14:03,683 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-05 15:14:04,652 - INFO - Sleeping: 2.845746 seconds | |
2025-06-05 15:14:07,499 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-05 15:14:08,322 - INFO - Sleeping: 2.857928 seconds | |
2025-06-05 15:14:11,188 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1500&max_results=100 | |
2025-06-05 15:14:12,125 - INFO - Sleeping: 2.841879 seconds | |
2025-06-05 15:14:14,976 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1600&max_results=100 | |
2025-06-05 15:14:15,197 - INFO - Sleeping: 2.998921 seconds | |
2025-06-05 15:14:18,203 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1600&max_results=100 | |
2025-06-05 15:14:19,256 - INFO - Sleeping: 2.850885 seconds | |
2025-06-05 15:14:22,108 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-05 15:14:23,871 - INFO - Sleeping: 2.854963 seconds | |
2025-06-05 15:14:26,732 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-05 15:14:26,932 - INFO - Sleeping: 2.998011 seconds | |
2025-06-05 15:14:29,937 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-05 15:14:31,167 - INFO - Sleeping: 2.733643 seconds | |
2025-06-05 15:14:33,916 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1900&max_results=100 | |
2025-06-05 15:14:34,806 - INFO - Sleeping: 2.860244 seconds | |
2025-06-05 15:14:37,682 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2000&max_results=100 | |
2025-06-05 15:14:38,740 - INFO - Sleeping: 2.871787 seconds | |
2025-06-05 15:14:41,624 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2100&max_results=100 | |
2025-06-05 15:14:42,560 - INFO - Sleeping: 2.846611 seconds | |
2025-06-05 15:14:45,413 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2200&max_results=100 | |
2025-06-05 15:14:46,227 - INFO - Sleeping: 2.864854 seconds | |
2025-06-05 15:14:49,104 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2300&max_results=100 | |
2025-06-05 15:14:50,683 - INFO - Sleeping: 2.868842 seconds | |
2025-06-05 15:14:53,558 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2400&max_results=100 | |
2025-06-05 15:14:54,543 - INFO - Sleeping: 2.798651 seconds | |
2025-06-05 15:14:57,356 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2500&max_results=100 | |
2025-06-05 15:14:58,360 - INFO - Sleeping: 2.843485 seconds | |
2025-06-05 15:15:01,209 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2600&max_results=100 | |
2025-06-05 15:15:02,488 - INFO - Sleeping: 2.862277 seconds | |
2025-06-05 15:15:05,352 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2700&max_results=100 | |
2025-06-05 15:15:05,589 - INFO - Sleeping: 2.998042 seconds | |
2025-06-05 15:15:08,598 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2700&max_results=100 | |
2025-06-05 15:15:09,544 - INFO - Sleeping: 2.866835 seconds | |
2025-06-05 15:15:12,418 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2800&max_results=100 | |
2025-06-05 15:15:13,356 - INFO - Sleeping: 2.850690 seconds | |
2025-06-05 15:15:16,222 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Aq-bio%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2900&max_results=100 | |
2025-06-05 15:15:17,265 - INFO - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100 | |
2025-06-05 15:15:17,974 - INFO - Got first page: 100 of 100127 total results | |
2025-06-05 15:15:17,980 - INFO - Sleeping: 2.852069 seconds | |
2025-06-05 15:15:20,842 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=100&max_results=100 | |
2025-06-05 15:15:21,580 - INFO - Sleeping: 2.836391 seconds | |
2025-06-05 15:15:24,429 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=200&max_results=100 | |
2025-06-05 15:15:25,288 - INFO - Sleeping: 2.855620 seconds | |
2025-06-05 15:15:28,159 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=300&max_results=100 | |
2025-06-05 15:15:29,142 - INFO - Sleeping: 2.844767 seconds | |
2025-06-05 15:15:31,995 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=400&max_results=100 | |
2025-06-05 15:15:33,348 - INFO - Sleeping: 2.842629 seconds | |
2025-06-05 15:15:36,197 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=500&max_results=100 | |
2025-06-05 15:15:37,023 - INFO - Sleeping: 2.839913 seconds | |
2025-06-05 15:15:39,868 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=600&max_results=100 | |
2025-06-05 15:15:40,791 - INFO - Sleeping: 2.847369 seconds | |
2025-06-05 15:15:43,640 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=700&max_results=100 | |
2025-06-05 15:15:44,519 - INFO - Sleeping: 2.817944 seconds | |
2025-06-05 15:15:47,349 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-05 15:15:47,609 - INFO - Sleeping: 2.997964 seconds | |
2025-06-05 15:15:50,611 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=800&max_results=100 | |
2025-06-05 15:15:51,573 - INFO - Sleeping: 2.843646 seconds | |
2025-06-05 15:15:54,433 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=900&max_results=100 | |
2025-06-05 15:15:55,437 - INFO - Sleeping: 2.840713 seconds | |
2025-06-05 15:15:58,291 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1000&max_results=100 | |
2025-06-05 15:15:59,414 - INFO - Sleeping: 2.822105 seconds | |
2025-06-05 15:16:02,240 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1100&max_results=100 | |
2025-06-05 15:16:03,318 - INFO - Sleeping: 2.824031 seconds | |
2025-06-05 15:16:06,150 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1200&max_results=100 | |
2025-06-05 15:16:07,284 - INFO - Sleeping: 2.843175 seconds | |
2025-06-05 15:16:10,132 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1300&max_results=100 | |
2025-06-05 15:16:11,072 - INFO - Sleeping: 2.837981 seconds | |
2025-06-05 15:16:13,919 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1400&max_results=100 | |
2025-06-05 15:16:14,844 - INFO - Sleeping: 2.858052 seconds | |
2025-06-05 15:16:17,716 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1500&max_results=100 | |
2025-06-05 15:16:18,634 - INFO - Sleeping: 2.842808 seconds | |
2025-06-05 15:16:21,480 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1600&max_results=100 | |
2025-06-05 15:16:22,616 - INFO - Sleeping: 2.772050 seconds | |
2025-06-05 15:16:25,390 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1700&max_results=100 | |
2025-06-05 15:16:26,461 - INFO - Sleeping: 2.684801 seconds | |
2025-06-05 15:16:29,153 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1800&max_results=100 | |
2025-06-05 15:16:30,247 - INFO - Sleeping: 2.834220 seconds | |
2025-06-05 15:16:33,085 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1900&max_results=100 | |
2025-06-05 15:16:33,345 - INFO - Sleeping: 2.998054 seconds | |
2025-06-05 15:16:36,358 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1900&max_results=100 | |
2025-06-05 15:16:36,778 - INFO - Sleeping: 2.998957 seconds | |
2025-06-05 15:16:39,781 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=1900&max_results=100 | |
2025-06-05 15:16:40,694 - INFO - Sleeping: 2.843340 seconds | |
2025-06-05 15:16:43,549 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2000&max_results=100 | |
2025-06-05 15:16:44,407 - INFO - Sleeping: 2.870744 seconds | |
2025-06-05 15:16:47,287 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2100&max_results=100 | |
2025-06-05 15:16:48,190 - INFO - Sleeping: 2.872180 seconds | |
2025-06-05 15:16:51,076 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2200&max_results=100 | |
2025-06-05 15:16:52,074 - INFO - Sleeping: 2.871416 seconds | |
2025-06-05 15:16:54,955 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2300&max_results=100 | |
2025-06-05 15:16:55,920 - INFO - Sleeping: 2.872292 seconds | |
2025-06-05 15:16:58,806 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2400&max_results=100 | |
2025-06-05 15:16:59,044 - INFO - Sleeping: 2.997999 seconds | |
2025-06-05 15:17:02,051 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2400&max_results=100 | |
2025-06-05 15:17:02,365 - INFO - Sleeping: 3.000000 seconds | |
2025-06-05 15:17:05,376 - INFO - Requesting page (first: False, try: 2): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2400&max_results=100 | |
2025-06-05 15:17:05,637 - INFO - Sleeping: 2.997478 seconds | |
2025-06-05 15:17:08,637 - INFO - Requesting page (first: False, try: 3): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2400&max_results=100 | |
2025-06-05 15:17:09,620 - INFO - Sleeping: 2.771679 seconds | |
2025-06-05 15:17:12,406 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2500&max_results=100 | |
2025-06-05 15:17:12,658 - INFO - Sleeping: 2.999001 seconds | |
2025-06-05 15:17:15,660 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2500&max_results=100 | |
2025-06-05 15:17:16,628 - INFO - Sleeping: 2.849411 seconds | |
2025-06-05 15:17:19,488 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2600&max_results=100 | |
2025-06-05 15:17:19,672 - INFO - Sleeping: 2.998053 seconds | |
2025-06-05 15:17:22,674 - INFO - Requesting page (first: False, try: 1): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2600&max_results=100 | |
2025-06-05 15:17:23,772 - INFO - Sleeping: 2.850707 seconds | |
2025-06-05 15:17:26,623 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2700&max_results=100 | |
2025-06-05 15:17:27,872 - INFO - Sleeping: 2.840207 seconds | |
2025-06-05 15:17:30,723 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2800&max_results=100 | |
2025-06-05 15:17:31,896 - INFO - Sleeping: 2.842472 seconds | |
2025-06-05 15:17:34,744 - INFO - Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=cat%3Acond-mat.mtrl-sci+OR+cat%3Amaterials%2A&id_list=&sortBy=submittedDate&sortOrder=descending&start=2900&max_results=100 | |
2025-06-05 15:17:36,203 - INFO - Saved checkpoint to scientific_corpus_data\arxiv_papers.jsonl | |
2025-06-05 15:17:36,203 - INFO - Collected 5989 arXiv papers in 298.29s | |
2025-06-05 15:17:36,457 - INFO - Saved checkpoint to scientific_corpus_data\arxiv_papers.jsonl | |
2025-06-05 15:17:36,457 - INFO - Added 5989 papers from arXiv | |
2025-06-05 15:17:36,457 - INFO - Fetching PubMed papers... | |
2025-06-05 15:17:36,457 - INFO - Starting PubMed paper collection... | |
2025-06-05 15:17:37,765 - ERROR - Network error fetching PubMed batch: HTTP Error 500: Internal Server Error | |
2025-06-05 15:17:37,766 - WARNING - Error in _fetch_pubmed_batch: HTTP Error 500: Internal Server Error. Retrying in 2.0s... | |
2025-06-05 15:18:03,278 - INFO - Saved checkpoint to scientific_corpus_data\pubmed_papers.jsonl | |
2025-06-05 15:18:03,278 - INFO - Collected 2667 PubMed papers in 26.82s | |
2025-06-05 15:18:03,279 - INFO - Processing 1000 biology papers... | |
2025-06-05 15:18:03,467 - INFO - Processed 946/1000 biology papers | |
2025-06-05 15:18:03,467 - INFO - Unknown domains: 0, Unknown sections: 144 | |
2025-06-05 15:18:03,468 - INFO - Processing 1000 biology papers... | |
2025-06-05 15:18:03,670 - INFO - Processed 991/1000 biology papers | |
2025-06-05 15:18:03,671 - INFO - Unknown domains: 0, Unknown sections: 459 | |
2025-06-05 15:18:03,671 - INFO - Processing 667 biology papers... | |
2025-06-05 15:18:03,842 - INFO - Processed 663/667 biology papers | |
2025-06-05 15:18:03,842 - INFO - Unknown domains: 0, Unknown sections: 413 | |
2025-06-05 15:18:03,942 - INFO - Saved checkpoint to scientific_corpus_data\pubmed_papers.jsonl | |
2025-06-05 15:18:03,942 - INFO - Added 2600 papers from PubMed | |
2025-06-05 15:18:03,943 - INFO - Fetching FineWeb-Edu papers... | |
2025-06-05 15:18:03,943 - INFO - Starting FineWeb-Edu collection... | |
2025-06-05 15:18:49,357 - INFO - Collected 10000 FineWeb samples | |
2025-06-05 15:18:58,079 - INFO - Collected 20000 FineWeb samples | |
2025-06-05 15:19:04,612 - INFO - Collected 30000 FineWeb samples | |
2025-06-05 15:19:04,614 - INFO - Processing 30000 FineWeb samples | |
2025-06-05 15:19:26,964 - INFO - Saved checkpoint to scientific_corpus_data\fineweb_edu.jsonl | |
2025-06-05 15:19:26,965 - INFO - Collected 29616 FineWeb-Edu papers in 83.02s | |
2025-06-05 15:19:26,983 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:28,269 - INFO - Processed 5354/1000 education papers | |
2025-06-05 15:19:28,269 - INFO - Unknown domains: 1000, Unknown sections: 696 | |
2025-06-05 15:19:28,269 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:29,616 - INFO - Processed 5622/1000 education papers | |
2025-06-05 15:19:29,616 - INFO - Unknown domains: 1000, Unknown sections: 729 | |
2025-06-05 15:19:29,617 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:31,345 - INFO - Processed 4975/1000 education papers | |
2025-06-05 15:19:31,345 - INFO - Unknown domains: 1000, Unknown sections: 733 | |
2025-06-05 15:19:31,347 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:32,824 - INFO - Processed 5011/1000 education papers | |
2025-06-05 15:19:32,824 - INFO - Unknown domains: 1000, Unknown sections: 754 | |
2025-06-05 15:19:32,825 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:34,098 - INFO - Processed 5349/1000 education papers | |
2025-06-05 15:19:34,098 - INFO - Unknown domains: 1000, Unknown sections: 733 | |
2025-06-05 15:19:34,098 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:35,465 - INFO - Processed 5667/1000 education papers | |
2025-06-05 15:19:35,465 - INFO - Unknown domains: 1000, Unknown sections: 722 | |
2025-06-05 15:19:35,465 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:36,732 - INFO - Processed 5081/1000 education papers | |
2025-06-05 15:19:36,732 - INFO - Unknown domains: 1000, Unknown sections: 720 | |
2025-06-05 15:19:36,733 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:37,777 - INFO - Processed 4592/1000 education papers | |
2025-06-05 15:19:37,777 - INFO - Unknown domains: 1000, Unknown sections: 737 | |
2025-06-05 15:19:37,778 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:39,062 - INFO - Processed 5222/1000 education papers | |
2025-06-05 15:19:39,063 - INFO - Unknown domains: 1000, Unknown sections: 737 | |
2025-06-05 15:19:39,063 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:40,299 - INFO - Processed 5117/1000 education papers | |
2025-06-05 15:19:40,299 - INFO - Unknown domains: 1000, Unknown sections: 729 | |
2025-06-05 15:19:40,300 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:41,587 - INFO - Processed 5179/1000 education papers | |
2025-06-05 15:19:41,588 - INFO - Unknown domains: 1000, Unknown sections: 700 | |
2025-06-05 15:19:41,588 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:42,756 - INFO - Processed 4847/1000 education papers | |
2025-06-05 15:19:42,756 - INFO - Unknown domains: 1000, Unknown sections: 736 | |
2025-06-05 15:19:42,756 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:44,015 - INFO - Processed 5101/1000 education papers | |
2025-06-05 15:19:44,015 - INFO - Unknown domains: 1000, Unknown sections: 709 | |
2025-06-05 15:19:44,016 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:45,446 - INFO - Processed 5065/1000 education papers | |
2025-06-05 15:19:45,446 - INFO - Unknown domains: 1000, Unknown sections: 705 | |
2025-06-05 15:19:45,446 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:46,816 - INFO - Processed 5452/1000 education papers | |
2025-06-05 15:19:46,816 - INFO - Unknown domains: 1000, Unknown sections: 714 | |
2025-06-05 15:19:46,817 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:48,103 - INFO - Processed 5244/1000 education papers | |
2025-06-05 15:19:48,104 - INFO - Unknown domains: 1000, Unknown sections: 712 | |
2025-06-05 15:19:48,104 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:49,656 - INFO - Processed 5388/1000 education papers | |
2025-06-05 15:19:49,656 - INFO - Unknown domains: 1000, Unknown sections: 719 | |
2025-06-05 15:19:49,657 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:50,870 - INFO - Processed 4964/1000 education papers | |
2025-06-05 15:19:50,870 - INFO - Unknown domains: 1000, Unknown sections: 722 | |
2025-06-05 15:19:50,870 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:52,243 - INFO - Processed 5677/1000 education papers | |
2025-06-05 15:19:52,243 - INFO - Unknown domains: 1000, Unknown sections: 692 | |
2025-06-05 15:19:52,243 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:53,816 - INFO - Processed 6299/1000 education papers | |
2025-06-05 15:19:53,816 - INFO - Unknown domains: 1000, Unknown sections: 685 | |
2025-06-05 15:19:53,818 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:55,379 - INFO - Processed 6131/1000 education papers | |
2025-06-05 15:19:55,379 - INFO - Unknown domains: 1000, Unknown sections: 701 | |
2025-06-05 15:19:55,380 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:57,140 - INFO - Processed 6091/1000 education papers | |
2025-06-05 15:19:57,140 - INFO - Unknown domains: 1000, Unknown sections: 661 | |
2025-06-05 15:19:57,141 - INFO - Processing 1000 education papers... | |
2025-06-05 15:19:58,712 - INFO - Processed 6183/1000 education papers | |
2025-06-05 15:19:58,712 - INFO - Unknown domains: 1000, Unknown sections: 684 | |
2025-06-05 15:19:58,713 - INFO - Processing 1000 education papers... | |
2025-06-05 15:20:00,147 - INFO - Processed 5664/1000 education papers | |
2025-06-05 15:20:00,147 - INFO - Unknown domains: 1000, Unknown sections: 697 | |
2025-06-05 15:20:00,147 - INFO - Processing 1000 education papers... | |
2025-06-05 15:20:01,540 - INFO - Processed 5700/1000 education papers | |
2025-06-05 15:20:01,540 - INFO - Unknown domains: 1000, Unknown sections: 729 | |
2025-06-05 15:20:01,541 - INFO - Processing 1000 education papers... | |
2025-06-05 15:20:02,755 - INFO - Processed 4994/1000 education papers | |
2025-06-05 15:20:02,755 - INFO - Unknown domains: 1000, Unknown sections: 735 | |
2025-06-05 15:20:02,756 - INFO - Processing 1000 education papers... | |
2025-06-05 15:20:04,147 - INFO - Processed 5757/1000 education papers | |
2025-06-05 15:20:04,148 - INFO - Unknown domains: 1000, Unknown sections: 682 | |
2025-06-05 15:20:04,148 - INFO - Processing 1000 education papers... | |
2025-06-05 15:20:05,516 - INFO - Processed 5505/1000 education papers | |
2025-06-05 15:20:05,516 - INFO - Unknown domains: 1000, Unknown sections: 725 | |
2025-06-05 15:20:05,516 - INFO - Processing 1000 education papers... | |
2025-06-05 15:20:06,939 - INFO - Processed 5139/1000 education papers | |
2025-06-05 15:20:06,939 - INFO - Unknown domains: 1000, Unknown sections: 716 | |
2025-06-05 15:20:06,939 - INFO - Processing 616 education papers... | |
2025-06-05 15:20:07,675 - INFO - Processed 3032/616 education papers | |
2025-06-05 15:20:07,675 - INFO - Unknown domains: 616, Unknown sections: 447 | |
2025-06-05 15:20:57,703 - INFO - Saved checkpoint to scientific_corpus_data\fineweb-edu_papers.jsonl | |
2025-06-05 15:20:57,705 - INFO - Added 159402 papers from FineWeb-Edu | |
2025-06-05 15:20:57,705 - INFO - Total papers collected: 167991 | |
2025-06-05 15:20:57,705 - INFO - Ranking and deduplicating papers... | |
2025-06-05 15:25:39,330 - INFO - Final corpus size: 167963 papers | |
2025-06-05 15:26:32,891 - INFO - Saved checkpoint to scientific_corpus_data\ranked_papers.jsonl | |
2025-06-05 16:39:00,342 - INFO - Processing final dataset in batches... | |
2025-06-05 16:39:56,075 - WARNING - scientific_corpus_325M.jsonl is larger than 10 MiB. HuggingFace will reject files >10 MiB unless you use Git LFS. See https://hf.co/docs/hub/repositories-getting-started#terminal | |
2025-06-05 16:39:56,076 - WARNING - To fix: install git-lfs and run 'git lfs track "*.jsonl"' before pushing, or split your file. | |
2025-06-05 16:39:56,080 - INFO - Scientific corpus successfully built: scientific_corpus_325M.jsonl | |
2025-06-05 16:40:00,188 - WARNING - C:\Users\kunya\PycharmProjects\DataVolt\Tokenization\./Scientific_Research_Tokenized is already a clone of https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized. Make sure you pull the latest changes with `repo.git_pull()`. | |
2025-06-05 16:46:36,513 - WARNING - Several commits (2) will be pushed upstream. | |
2025-06-05 16:46:36,531 - WARNING - The progress bars may be unreliable. | |
2025-06-05 19:44:05,295 - WARNING - EOF | |
error: failed to push some refs to 'https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized' | |
2025-06-05 19:44:06,016 - ERROR - Error during Hugging Face upload: EOF | |
error: failed to push some refs to 'https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized' | |