Spaces:

3a05chatgpt
/

pdf-summarizer-app

Sleeping

App Files Files Community

3a05chatgpt commited on Jul 11

Commit

f1e5728

verified ·

1 Parent(s): 93c008b

Upload 8 files

Browse files

Files changed (8) hide show

README.md +37 -12
app.py +95 -72
gitattributes +6 -35
papersearch.py +28 -151
pdfpass.py +8 -28
pdfsum.py +30 -123
requirements.txt +0 -0
textsumm.py +7 -24

README.md CHANGED Viewed

@@ -1,12 +1,37 @@
----
-title: Pdf Tools Suite
-emoji: 📚
-colorFrom: gray
-colorTo: purple
-sdk: streamlit
-sdk_version: 1.42.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: PDF工具箱（多功能PDF助手）
+emoji: 📄
+colorFrom: blue
+colorTo: green
+sdk: streamlit
+sdk_version: 1.35.0
+app_file: app.py
+pinned: false
+license: mit
+---
+# 📄 PDF 工具箱（全功能多合一）
+這是一個多功能的 PDF 處理平台，支援下列中文化操作：
+- **文字摘要**：用 OpenAI GPT-4/4.1/4.5 模型自動生成關鍵重點摘要
+- **PDF 摘要**：支援長篇PDF內容摘要
+- **PDF 密碼移除**：移除加密 PDF 密碼
+- **arXiv 論文搜尋**：中文介面搜尋並過濾論文
+- **PDF 合併**、**拆頁**、**轉文字**等多功能
+- 全面中文介面與說明，適合教育、研究、行政等需求
+## 使用說明
+1. 於側邊欄輸入你的 OpenAI API Key（sk- 或 sk-proj- 開頭）
+2. 選擇所需 GPT 模型（gpt-4, gpt-4.1, gpt-4.5）
+3. 選擇左側功能分頁並依需求操作上傳文件
+4. 所有步驟均有中文提示
+> 💡 **注意**：API Key 僅用於本次對話，不會儲存於伺服器，請安心使用！
+## 聯絡與貢獻
+本專案歡迎改進建議或功能增補，請於 Hugging Face 或 GitHub 提出 issue。
+---

app.py CHANGED Viewed

@@ -1,81 +1,104 @@
 import streamlit as st
-from textsumm import summarizer
-from pdfsum import extract_text_from_pdf, summarize_text, split_text_into_chunks
-from pdfpass import remove_pdf_password
-from papersearch import fetch_papers, filter_papers_by_year
 from io import BytesIO
 from datetime import datetime
 from pypdf import PdfReader, PdfWriter
-# Streamlit App Config
-st.set_page_config(page_title="PDF Tools Suite", page_icon="📄", layout="wide")
-# Sidebar Navigation
-st.sidebar.title("📄 PDF Tools Suite")
-page = st.sidebar.radio("Select a tool", ["Text Summarizer", "PDF Summarizer", "PDF Password Remover", "Research Paper Search", "PDF Merger", "PDF Splitter", "PDF to Text Converter"])
-# Tool: Text Summarizer
-if page == "Text Summarizer":
-    st.title("📝 Text Summarizer")
-    user_input = st.text_area("Enter text to summarize")
-    if st.button("Summarize"):
-        summary = summarizer(user_input, max_length=130, min_length=30, do_sample=False)
-        st.subheader("Summary")
-        st.write(summary[0]["summary_text"])
-# Tool: PDF Summarizer
-elif page == "PDF Summarizer":
-    st.title("📜 PDF Summarizer")
-    uploaded_file = st.file_uploader("Upload your PDF", type=["pdf"])
-    if uploaded_file is not None:
-        pdf_text = extract_text_from_pdf(uploaded_file)
-        chunks = split_text_into_chunks(pdf_text)
-        summaries = summarize_text(chunks)
-        full_summary = " ".join(summaries)
-        st.subheader("Summary")
-        st.write(full_summary)
-# Tool: PDF Password Remover
-elif page == "PDF Password Remover":
-    st.title("🔑 Remove PDF Password")
-    uploaded_file = st.file_uploader("Choose a password-protected PDF", type=["pdf"])
-    password = st.text_input("Enter the PDF password", type="password")
-    if uploaded_file and password and st.button("Remove Password"):
-        output = remove_pdf_password(uploaded_file, password)
         if isinstance(output, BytesIO):
-            st.success("Password removed successfully!")
-            st.download_button("Download PDF", data=output, file_name="unlocked_pdf.pdf", mime="application/pdf")
         else:
-            st.error(f"Error: {output}")
-# Tool: Research Paper Search
-elif page == "Research Paper Search":
-    st.title("🔍 Research Paper Search (arXiv)")
-    query = st.text_input("Enter topic or keywords", placeholder="e.g., machine learning")
-    max_results = st.slider("Number of results", 1, 50, 10)
     col1, col2 = st.columns(2)
     with col1:
-        start_year = st.number_input("Start Year", min_value=1900, max_value=datetime.now().year, value=2000)
     with col2:
-        end_year = st.number_input("End Year", min_value=1900, max_value=datetime.now().year, value=datetime.now().year)
-    if st.button("Search"):
-        papers = fetch_papers(query, max_results)
-        papers_filtered = filter_papers_by_year(papers, start_year, end_year)
-        if papers_filtered:
-            for idx, paper in enumerate(papers_filtered, start=1):
-                st.write(f"### {idx}. {paper['title']}")
-                st.write(f"**Authors**: {', '.join(paper['authors'])}")
-                st.write(f"**Published**: {paper['published']}")
-                st.write(f"[Read More]({paper['link']})")
                 st.write("---")
         else:
-            st.warning("No papers found in the selected range.")
-# Tool: PDF Merger
-elif page == "PDF Merger":
-    st.title("📎 Merge Multiple PDFs")
-    uploaded_files = st.file_uploader("Upload multiple PDF files", type=["pdf"], accept_multiple_files=True)
-    if uploaded_files and st.button("Merge PDFs"):
         pdf_writer = PdfWriter()
         for file in uploaded_files:
             pdf_reader = PdfReader(file)
@@ -84,12 +107,12 @@ elif page == "PDF Merger":
         output = BytesIO()
         pdf_writer.write(output)
         output.seek(0)
-        st.download_button("Download Merged PDF", data=output, file_name="merged.pdf", mime="application/pdf")
-# Tool: PDF Splitter
-elif page == "PDF Splitter":
-    st.title("✂️ Split PDF into Pages")
-    uploaded_file = st.file_uploader("Upload a PDF", type=["pdf"])
     if uploaded_file:
         pdf_reader = PdfReader(uploaded_file)
         for i, page in enumerate(pdf_reader.pages):
@@ -98,12 +121,12 @@ elif page == "PDF Splitter":
             output = BytesIO()
             pdf_writer.write(output)
             output.seek(0)
-            st.download_button(f"Download Page {i+1}", data=output, file_name=f"page_{i+1}.pdf", mime="application/pdf")
-# Tool: PDF to Text Converter
-elif page == "PDF to Text Converter":
-    st.title("📜 Extract Text from PDF")
-    uploaded_file = st.file_uploader("Upload a PDF", type=["pdf"])
     if uploaded_file:
-        pdf_text = extract_text_from_pdf(uploaded_file)
-        st.text_area("Extracted Text", pdf_text, height=300)

 import streamlit as st
+import openai
+from textsumm import 文字摘要
+from pdfsum import 提取_pdf文字, 分段, 摘要
+from pdfpass import 移除_pdf密碼
+from papersearch import 抓取論文, 篩選論文依年份
 from io import BytesIO
 from datetime import datetime
 from pypdf import PdfReader, PdfWriter
+# ---- 一定要在所有 st.xxx 指令之前 ----
+st.set_page_config(page_title="PDF 工具箱", page_icon="📄", layout="wide")
+# ---- 側邊欄（API Key 與模型選擇）----
+st.sidebar.title("📄 PDF 工具箱")
+api_key = st.sidebar.text_input("請輸入 OpenAI API 金鑰", type="password", placeholder="sk-...")
+selected_model = st.sidebar.radio("選擇 GPT 模型", ["gpt-4", "gpt-4.0", "gpt-4.1", "gpt-4.5"], index=0)
+if api_key:
+    openai.api_key = api_key
+else:
+    st.sidebar.warning("請輸入你的 OpenAI API Key（sk- 或 sk-proj- 開頭）")
+# ---- 分頁功能 ----
+page = st.sidebar.radio(
+    "選擇功能",
+    [
+        "文字摘要",
+        "PDF 摘要",
+        "PDF 密碼移除",
+        "論文搜尋",
+        "PDF 合併",
+        "PDF 拆頁",
+        "PDF 轉純文字"
+    ]
+)
+# 文字摘要
+if page == "文字摘要":
+    st.title("📝 文字摘要")
+    user_input = st.text_area("請輸入要摘要的文字")
+    if st.button("生成摘要"):
+        if not api_key:
+            st.error("請先輸入 OpenAI API 金鑰！")
+        else:
+            結果 = 文字摘要(user_input)
+            st.subheader("摘要結果")
+            st.write(結果[0]["summary_text"])
+# PDF 摘要
+elif page == "PDF 摘要":
+    st.title("📜 PDF 摘要")
+    uploaded_file = st.file_uploader("上傳你的 PDF 檔案", type=["pdf"])
+    if uploaded_file is not None and st.button("產生 PDF 摘要"):
+        pdf_text = 提取_pdf文字(uploaded_file)
+        段落們 = 分段(pdf_text)
+        全部摘要 = " ".join(摘要(段落們))
+        st.subheader("摘要結果")
+        st.write(全部摘要)
+# PDF 密碼移除
+elif page == "PDF 密碼移除":
+    st.title("🔑 PDF 密碼移除")
+    uploaded_file = st.file_uploader("選擇需要解鎖的 PDF 檔案", type=["pdf"])
+    password = st.text_input("請輸入 PDF 密碼", type="password")
+    if uploaded_file and password and st.button("移除密碼"):
+        output = 移除_pdf密碼(uploaded_file, password)
         if isinstance(output, BytesIO):
+            st.success("密碼移除成功！")
+            st.download_button("下載已解鎖的 PDF", data=output, file_name="unlocked_pdf.pdf", mime="application/pdf")
         else:
+            st.error(f"錯誤：{output}")
+# 論文搜尋
+elif page == "論文搜尋":
+    st.title("🔍 論文搜尋（arXiv）")
+    query = st.text_input("輸入主題或關鍵字", placeholder="例如：人工智慧、量子計算")
+    max_results = st.slider("結果數量", 1, 50, 10)
     col1, col2 = st.columns(2)
     with col1:
+        start_year = st.number_input("起始年份", min_value=1900, max_value=datetime.now().year, value=2000)
     with col2:
+        end_year = st.number_input("結束年份", min_value=1900, max_value=datetime.now().year, value=datetime.now().year)
+    if st.button("搜尋論文"):
+        papers = 抓取論文(query, max_results)
+        篩選後 = 篩選論文依年份(papers, start_year, end_year)
+        if 篩選後:
+            for idx, 論文 in enumerate(篩選後, start=1):
+                st.write(f"### {idx}. {論文['標題']}")
+                st.write(f"**作者**: {', '.join(論文['作者'])}")
+                st.write(f"**發表時間**: {論文['發表時間']}")
+                st.write(f"[閱讀全文]({論文['連結']})")
                 st.write("---")
         else:
+            st.warning("在所選年份範圍內沒有找到相關論文。")
+# PDF 合併
+elif page == "PDF 合併":
+    st.title("📎 多檔 PDF 合併")
+    uploaded_files = st.file_uploader("上傳多個 PDF 檔案", type=["pdf"], accept_multiple_files=True)
+    if uploaded_files and st.button("合併 PDF"):
         pdf_writer = PdfWriter()
         for file in uploaded_files:
             pdf_reader = PdfReader(file)
         output = BytesIO()
         pdf_writer.write(output)
         output.seek(0)
+        st.download_button("下載合併後的 PDF", data=output, file_name="merged.pdf", mime="application/pdf")
+# PDF 拆頁
+elif page == "PDF 拆頁":
+    st.title("✂️ PDF 拆頁")
+    uploaded_file = st.file_uploader("上傳一個 PDF", type=["pdf"])
     if uploaded_file:
         pdf_reader = PdfReader(uploaded_file)
         for i, page in enumerate(pdf_reader.pages):
             output = BytesIO()
             pdf_writer.write(output)
             output.seek(0)
+            st.download_button(f"下載第 {i+1} 頁", data=output, file_name=f"page_{i+1}.pdf", mime="application/pdf")
+# PDF 轉純文字
+elif page == "PDF 轉純文字":
+    st.title("📜 PDF 轉純文字")
+    uploaded_file = st.file_uploader("上傳 PDF", type=["pdf"])
     if uploaded_file:
+        pdf_text = 提取_pdf文字(uploaded_file)
+        st.text_area("擷取內容", pdf_text, height=300)

gitattributes CHANGED Viewed

@@ -1,35 +1,6 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

+# Git LFS 屬性設定檔（可用於大檔案控制）
+*.pdf filter=lfs diff=lfs merge=lfs -text
+*.jpg filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+# 中文註解：上面設定會讓 PDF/圖片走 Git LFS（大檔案友善處理）

papersearch.py CHANGED Viewed

@@ -1,154 +1,31 @@
-# import streamlit as st
-# import requests
-# import xmltodict
-# # arXiv API base URL
-# ARXIV_API_BASE = "http://export.arxiv.org/api/query"
-# def fetch_papers(query, max_results=10):
-#     """Fetch papers from the arXiv API."""
-#     try:
-#         # Build the API query URL
-#         api_url = f"{ARXIV_API_BASE}?search_query=all:{query}&start=0&max_results={max_results}"
-#         # Make the API request
-#         response = requests.get(api_url, headers={'Accept': 'application/xml'})
-#         response.raise_for_status()
-#         # Parse the XML response
-#         data = xmltodict.parse(response.text)
-#         entries = data.get('feed', {}).get('entry', [])
-#         if not isinstance(entries, list):  # Handle single result
-#             entries = [entries]
-#         # Extract relevant fields
-#         papers = []
-#         for entry in entries:
-#             papers.append({
-#                 'title': entry.get('title'),
-#                 'summary': entry.get('summary'),
-#                 'published': entry.get('published'),
-#                 'authors': [author['name'] for author in entry.get('author', [])] if isinstance(entry.get('author'), list) else [entry.get('author', {}).get('name')],
-#                 'link': entry.get('id')
-#             })
-#         return papers
-#     except Exception as e:
-#         st.error(f"Error fetching papers: {e}")
-#         return []
-# # Streamlit app UI
-# st.title("arXiv Research Paper Search")
-# st.subheader("Find academic papers on your topic of interest")
-# # Input fields
-# query = st.text_input("Enter a topic or keywords", placeholder="e.g., machine learning, quantum computing")
-# max_results = st.slider("Number of results", min_value=1, max_value=50, value=10)
-# if st.button("Search"):
-#     if query.strip():
-#         st.info(f"Searching for papers on: **{query}**")
-#         papers = fetch_papers(query, max_results)
-#         if papers:
-#             st.success(f"Found {len(papers)} papers!")
-#             for idx, paper in enumerate(papers, start=1):
-#                 st.write(f"### {idx}. {paper['title']}")
-#                 st.write(f"**Authors**: {', '.join(paper['authors'])}")
-#                 st.write(f"**Published**: {paper['published']}")
-#                 st.write(f"[Read More]({paper['link']})")
-#                 st.write("---")
-#         else:
-#             st.warning("No papers found. Try a different query.")
-#     else:
-#         st.error("Please enter a topic or keywords to search.")
-import streamlit as st
 import requests
-import xmltodict
 from datetime import datetime
-# arXiv API base URL
-ARXIV_API_BASE = "http://export.arxiv.org/api/query"
-def fetch_papers(query, max_results=10):
-    """Fetch papers from the arXiv API."""
-    try:
-        # Build the API query URL
-        api_url = f"{ARXIV_API_BASE}?search_query=all:{query}&start=0&max_results={max_results}"
-        # Make the API request
-        response = requests.get(api_url, headers={'Accept': 'application/xml'})
-        response.raise_for_status()
-        # Parse the XML response
-        data = xmltodict.parse(response.text)
-        entries = data.get('feed', {}).get('entry', [])
-        if not isinstance(entries, list):  # Handle single result
-            entries = [entries]
-        # Extract relevant fields
-        papers = []
-        for entry in entries:
-            papers.append({
-                'title': entry.get('title'),
-                'summary': entry.get('summary'),
-                'published': entry.get('published'),
-                'authors': [author['name'] for author in entry.get('author', [])] if isinstance(entry.get('author'), list) else [entry.get('author', {}).get('name')],
-                'link': entry.get('id')
-            })
-        return papers
-    except Exception as e:
-        st.error(f"Error fetching papers: {e}")
-        return []
-def filter_papers_by_year(papers, start_year, end_year):
-    """Filter papers by the publication year range."""
-    filtered_papers = []
-    for paper in papers:
-        try:
-            published_year = int(paper['published'][:4])  # Extract year from the published date
-            if start_year <= published_year <= end_year:
-                filtered_papers.append(paper)
-        except ValueError:
-            continue  # Skip if the year is not valid
-    return filtered_papers
-# Streamlit app UI
-st.title("arXiv Research Paper Search")
-st.subheader("Find academic papers on your topic of interest")
-# Input fields
-query = st.text_input("Enter a topic or keywords", placeholder="e.g., machine learning, quantum computing")
-max_results = st.slider("Number of results", min_value=1, max_value=50, value=10)
-# Year filter
-col1, col2 = st.columns(2)
-with col1:
-    start_year = st.number_input("Start Year", min_value=1900, max_value=datetime.now().year, value=2000, step=1)
-with col2:
-    end_year = st.number_input("End Year", min_value=1900, max_value=datetime.now().year, value=datetime.now().year, step=1)
-if st.button("Search"):
-    if query.strip():
-        st.info(f"Searching for papers on: **{query}**")
-        papers = fetch_papers(query, max_results)
-        # Filter papers by year
-        papers_filtered = filter_papers_by_year(papers, start_year, end_year)
-        if papers_filtered:
-            st.success(f"Found {len(papers_filtered)} papers between {start_year} and {end_year}!")
-            for idx, paper in enumerate(papers_filtered, start=1):
-                st.write(f"### {idx}. {paper['title']}")
-                st.write(f"**Authors**: {', '.join(paper['authors'])}")
-                st.write(f"**Published**: {paper['published']}")
-                st.write(f"[Read More]({paper['link']})")
-                st.write("---")
-        else:
-            st.warning(f"No papers found between {start_year} and {end_year}. Try a different query or adjust the year range.")
-    else:
-        st.error("Please enter a topic or keywords to search.")

 import requests
+import xml.etree.ElementTree as ET
 from datetime import datetime
+def 抓取論文(關鍵字, 最大數量=10):
+    """
+    從 arXiv 依關鍵字搜尋論文（最新）
+    """
+    url = f"https://export.arxiv.org/api/query?search_query=all:{關鍵字}&start=0&max_results={最大數量}&sortBy=lastUpdatedDate"
+    res = requests.get(url)
+    root = ET.fromstring(res.content)
+    論文清單 = []
+    for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):
+        論文清單.append({
+            "標題": entry.find('{http://www.w3.org/2005/Atom}title').text.strip(),
+            "作者": [author.find('{http://www.w3.org/2005/Atom}name').text for author in entry.findall('{http://www.w3.org/2005/Atom}author')],
+            "發表時間": entry.find('{http://www.w3.org/2005/Atom}published').text[:10],
+            "連結": entry.find('{http://www.w3.org/2005/Atom}id').text
+        })
+    return 論文清單
+def 篩選論文依年份(論文清單, 起始, 結束):
+    """
+    依年份篩選論文（年分區間）
+    """
+    篩選 = []
+    for 論文 in 論文清單:
+        年份 = int(論文["發表時間"][:4])
+        if 起始 <= 年份 <= 結束:
+            篩選.append(論文)
+    return 篩選

pdfpass.py CHANGED Viewed

@@ -1,40 +1,20 @@
-import streamlit as st
-from PyPDF2 import PdfReader, PdfWriter
 from io import BytesIO
-def remove_pdf_password(file, password):
     try:
-        reader = PdfReader(file)
         if reader.is_encrypted:
-            reader.decrypt(password)
         writer = PdfWriter()
         for page in reader.pages:
             writer.add_page(page)
         output = BytesIO()
         writer.write(output)
         output.seek(0)
         return output
     except Exception as e:
-        return str(e)
-st.title("PDF Password Remover")
-st.write("Upload a password-protected PDF and remove its password.")
-# File upload
-uploaded_file = st.file_uploader("Choose a PDF file", type=["pdf"])
-password = st.text_input("Enter the PDF password", type="password")
-if uploaded_file and password:
-    if st.button("Remove Password"):
-        output = remove_pdf_password(uploaded_file, password)
-        if isinstance(output, BytesIO):
-            st.success("Password removed successfully!")
-            st.download_button(
-                label="Download PDF without Password",
-                data=output,
-                file_name="unlocked_pdf.pdf",
-                mime="application/pdf",
-            )
-        else:
-            st.error(f"Error: {output}")

+from pypdf import PdfReader, PdfWriter
 from io import BytesIO
+def 移除_pdf密碼(pdf檔案, 密碼):
+    """
+    解鎖帶有密碼保護的 PDF 檔案，回傳已解鎖的檔案（BytesIO）或錯誤訊息
+    """
     try:
+        reader = PdfReader(pdf檔案)
         if reader.is_encrypted:
+            reader.decrypt(密碼)
         writer = PdfWriter()
         for page in reader.pages:
             writer.add_page(page)
         output = BytesIO()
         writer.write(output)
         output.seek(0)
         return output
     except Exception as e:
+        return f"解鎖失敗：{e}"

pdfsum.py CHANGED Viewed

@@ -1,125 +1,32 @@
-# import streamlit as st
-# from transformers import pipeline
-# from PyPDF2 import PdfReader
-# # Initialize the summarizer
-# summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
-# def extract_text_from_pdf(pdf_file):
-#     """Extract text from an uploaded PDF file."""
-#     try:
-#         reader = PdfReader(pdf_file)
-#         text = ""
-#         for page in reader.pages:
-#             page_text = page.extract_text()
-#             if page_text:  # Skip pages with no text
-#                 text += page_text + "\n"
-#         return text
-#     except Exception as e:
-#         raise ValueError(f"Error extracting text from PDF: {e}")
-# def split_text_into_chunks(text, max_chunk_size=1024):
-#     """Split the text into smaller chunks for summarization."""
-#     chunks = []
-#     while len(text) > max_chunk_size:
-#         split_point = text.rfind(". ", 0, max_chunk_size) + 1  # Split at the last sentence boundary
-#         if split_point == 0:  # No sentence boundary found, split arbitrarily
-#             split_point = max_chunk_size
-#         chunks.append
-# # Streamlit Dashboard
-# st.title("PDF Summarizer")
-# st.write("Upload a PDF file to get a summarized version of its content.")
-# uploaded_file = st.file_uploader("Upload your PDF", type=["pdf"])
-# if uploaded_file is not None:
-#     # Extract text from the PDF
-#     st.write("Processing your PDF...")
-#     try:
-#         pdf_text = extract_text_from_pdf(uploaded_file)
-#         st.write("PDF content extracted successfully.")
-#         # Display extracted text (optional)
-#         with st.expander("View Extracted Text"):
-#             st.text_area("Extracted Text", pdf_text, height=300)
-#         # Summarize the extracted text
-#         if st.button("Summarize"):
-#             st.write("Generating summary...")
-#             summary = summarizer(pdf_text, max_length=130, min_length=30, do_sample=False)
-#             st.subheader("Summary")
-#             st.write(summary[0]["summary_text"])
-#     except Exception as e:
-#         st.error(f"An error occurred while processing the PDF: {str(e)}")
-import streamlit as st
 from transformers import pipeline
-import pdfplumber
-# Initialize the summarizer
-summarizer = pipeline("summarization", model="t5-small")
-def extract_text_from_pdf(pdf_file):
-    """Extract text from an uploaded PDF file using pdfplumber."""
-    try:
-        text = ""
-        with pdfplumber.open(pdf_file) as pdf:
-            for page in pdf.pages:
-                text += page.extract_text() + "\n"
-        if not text.strip():
-            raise ValueError("No extractable text found in the PDF.")
-        return text
-    except Exception as e:
-        raise ValueError(f"Error extracting text from PDF: {e}")
-def split_text_into_chunks(text, max_chunk_size=1024):
-    """Split the text into smaller chunks for summarization."""
-    chunks = []
-    while len(text) > max_chunk_size:
-        split_point = text.rfind(". ", 0, max_chunk_size) + 1  # Find the last full sentence
-        if split_point == 0:  # No sentence boundary found, split arbitrarily
-            split_point = max_chunk_size
-        chunks.append(text[:split_point])
-        text = text[split_point:]
-    if text:
-        chunks.append(text)
-    return chunks
-def summarize_text(chunks):
-    """Summarize each chunk of text with dynamic max_length."""
-    summaries = []
-    for chunk in chunks:
-        input_length = len(chunk.split())  # Approximate token count
-        max_length = max(48, int(input_length * 0.8))  # Set max_length to 80% of input length
-        summary = summarizer(chunk, max_length=max_length, min_length=10, do_sample=False)
-        summaries.append(summary[0]["summary_text"])
-    return summaries
-# Streamlit Dashboard
-st.title("PDF Summarizer")
-st.write("Upload a PDF file to get a summarized version of its content.")
-uploaded_file = st.file_uploader("Upload your PDF", type=["pdf"])
-if uploaded_file is not None:
-    try:
-        # Extract text from the PDF
-        st.write("Processing your PDF...")
-        pdf_text = extract_text_from_pdf(uploaded_file)
-        st.write("PDF content extracted successfully.")
-        # Display extracted text (optional)
-        with st.expander("View Extracted Text"):
-            st.text_area("Extracted Text", pdf_text, height=300)
-        # Summarize the extracted text
-        if st.button("Summarize"):
-            st.write("Generating summary...")
-            chunks = split_text_into_chunks(pdf_text)
-            summaries = summarize_text(chunks)
-            full_summary = " ".join(summaries)
-            st.subheader("Summary")
-            st.write(full_summary)
-    except Exception as e:
-        st.error(f"An error occurred while processing the PDF: {str(e)}")

+from PyPDF2 import PdfReader
 from transformers import pipeline
+# 這裡你也可以改成你要的中文 BART、T5 等 summarization 模型
+summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
+def 提取_pdf文字(pdf檔案):
+    """
+    從 PDF 檔案讀取並合併所有頁面的內容為純文字
+    """
+    reader = PdfReader(pdf檔案)
+    內容 = ""
+    for 頁面 in reader.pages:
+        內容 += 頁面.extract_text()
+    return 內容
+def 分段(內容, 每段字數=2000):
+    """
+    將長文本切成多個段落（方便 AI 處理）
+    """
+    return [內容[i:i+每段字數] for i in range(0, len(內容), 每段字數)]
+def 摘要(段落們):
+    """
+    對每個段落做中文摘要，再合併回一份總結
+    """
+    結果 = []
+    for 段 in 段落們:
+        結果.append(
+            summarizer(段, max_length=130, min_length=30, do_sample=False)[0]["summary_text"]
+        )
+    return 結果

requirements.txt CHANGED Viewed

Binary files a/requirements.txt and b/requirements.txt differ

textsumm.py CHANGED Viewed

@@ -1,28 +1,11 @@
 from transformers import pipeline
 summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
-ARTICLE ="""
-There is widespread international concern that Russia's war will provoke a global food crisis similar to, or
-worse than, that faced in 2007 and 2008. The war comes at a time when the global food system was already
-struggling to feed its growing population in a sustainable way, under the pressure caused by climate change
-and the Covid-19 pandemic. Russia and Ukraine are key agricultural players, together exporting nearly 12
-% of food calories traded globally. They are major providers of basic agro-commodities, including wheat,
-maize and sunflower oil, and Russia is the world's top exporter of fertilisers. The global supply chain will
-get impacted until Russia and Ukraine retreat and will end the war.
-The war's impact on global food supply centred on three factors. First is a significant reduction in exports
-and production of essential commodities from both countries, caused by the war and not the economic
-sanctions imposed on Russia, which, intentionally, did not target the agricultural sector. Overall, the
-European Commission estimates that 'up to 25 million tonnes of wheat would need to be substituted to
-meet worldwide food needs in the current and the next season. Second factor is a global spike in prices of
-food supplies and inputs needed for agri-food production, which were already at record levels before the
-war. The war has further pushed the prices up. Third factor is the international response to the above,
-which could either amplify the effects of the crisis (mainly by uncoordinated export bans) or mitigate them
-(applying lessons learnt from the 2007-2008 food crisis). A number of countries, other than Russia and
-Ukraine, have already imposed or announced their intention to impose some control over exports of
-essential agricultural commodities, including Egypt, Argentina, Indonesia, Serbia, Turkey and, in the EU,
-Hungary. We should keep this in our mind that the long duration of war will make the global situation
-irrecoverable.
-"""
-print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

 from transformers import pipeline
+# 建立中文摘要管道
 summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
+def 文字摘要(輸入文本, max_length=130, min_length=30, do_sample=False):
+    """
+    將輸入的純文字自動摘要為繁體中文重點
+    """
+    result = summarizer(輸入文本, max_length=max_length, min_length=min_length, do_sample=do_sample)
+    return result