Spaces:
Paused
Paused
add scratchpad
Browse files
scratchpad/filtering-arxiv-extracts-category-or-keyword.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Based on the code and ArXiv summaries provided, there are a few things that could be done to better filter extracts to focus on LLM (Large Language Model) and VLM (Vision-Language Model) related papers, with a particular emphasis on agentic systems:
|
2 |
+
|
3 |
+
1. Refine the search query:
|
4 |
+
Instead of using a broad query, use more specific keywords related to LLMs, VLMs, and agentic systems. For example:
|
5 |
+
- "large language model" OR "vision language model" OR "multimodal AI"
|
6 |
+
- "agentic systems" OR "multi-agent" OR "autonomous agents"
|
7 |
+
- "reinforcement learning" OR "self-supervised learning" in combination with the above terms
|
8 |
+
|
9 |
+
2. Add category filters:
|
10 |
+
The ArXiv API allows filtering by category. Focus on relevant categories such as:
|
11 |
+
- cs.AI (Artificial Intelligence)
|
12 |
+
- cs.CL (Computation and Language)
|
13 |
+
- cs.CV (Computer Vision and Pattern Recognition)
|
14 |
+
- cs.LG (Machine Learning)
|
15 |
+
- cs.MA (Multiagent Systems)
|
16 |
+
|
17 |
+
3. Implement keyword-based filtering:
|
18 |
+
After fetching the results, implement an additional filtering step that checks for the presence of relevant keywords in the title, abstract, or categories. This could be done by creating a list of relevant terms and checking if any of them appear in the paper metadata.
|
19 |
+
|
20 |
+
4. Use natural language processing techniques:
|
21 |
+
Employ more advanced NLP techniques to analyze the abstracts and determine relevance. This could include:
|
22 |
+
- Topic modeling to identify papers discussing LLMs, VLMs, or agentic systems
|
23 |
+
- Semantic similarity comparison with a reference text describing the topics of interest
|
24 |
+
- Named entity recognition to identify mentions of specific models or techniques
|
25 |
+
|
26 |
+
5. Implement a scoring system:
|
27 |
+
Assign scores to papers based on their relevance to LLMs, VLMs, and agentic systems. This could be based on keyword matches, category relevance, and other factors. Then, sort the results by this score and only return the top N most relevant papers.
|
28 |
+
|
29 |
+
6. Utilize recent advancements:
|
30 |
+
Given that LLMs and VLMs are rapidly evolving fields, prioritize more recent papers by adding a date filter or giving higher weight to newer publications.
|
31 |
+
|
32 |
+
7. Expand the metadata retrieval:
|
33 |
+
If possible, retrieve additional metadata such as full abstracts, author information, and citations. This extra information could be used for more accurate filtering and relevance determination.
|
34 |
+
|
35 |
+
8. Implement feedback loop:
|
36 |
+
Allow users to mark papers as relevant or irrelevant, and use this feedback to improve the filtering algorithm over time.
|
37 |
+
|
38 |
+
Here's a basic example of how you might implement some of these suggestions in the existing code:
|
39 |
+
|
40 |
+
```python
|
41 |
+
from arxiv_retrieval_service import ArxivRetrievalService
|
42 |
+
|
43 |
+
def is_relevant(paper):
|
44 |
+
relevant_keywords = ["large language model", "vision language model", "multimodal AI",
|
45 |
+
"agentic system", "multi-agent", "autonomous agent"]
|
46 |
+
relevant_categories = ["cs.AI", "cs.CL", "cs.CV", "cs.LG", "cs.MA"]
|
47 |
+
|
48 |
+
# Check if any relevant keyword is in the title or abstract
|
49 |
+
if any(keyword in paper['title'].lower() or keyword in paper['summary'].lower()
|
50 |
+
for keyword in relevant_keywords):
|
51 |
+
return True
|
52 |
+
|
53 |
+
# Check if any relevant category is in the paper's categories
|
54 |
+
if any(category in paper['categories'] for category in relevant_categories):
|
55 |
+
return True
|
56 |
+
|
57 |
+
return False
|
58 |
+
|
59 |
+
def fetch_relevant_papers(query, max_results=100):
|
60 |
+
arxiv_service = ArxivRetrievalService()
|
61 |
+
all_papers = arxiv_service.fetch_metadata(query, max_results)
|
62 |
+
|
63 |
+
# Filter papers based on relevance
|
64 |
+
relevant_papers = [paper for paper in all_papers if is_relevant(paper)]
|
65 |
+
|
66 |
+
return relevant_papers
|
67 |
+
|
68 |
+
# Usage
|
69 |
+
query = "(large language model OR vision language model OR multimodal AI) AND (agentic system OR multi-agent OR autonomous agent)"
|
70 |
+
relevant_papers = fetch_relevant_papers(query)
|
71 |
+
```
|
72 |
+
|
73 |
+
This example implements a basic keyword and category-based filter. It could be further refined and expanded based on the specific needs and the desired level of precision in filtering.
|