Spaces:

huggingface-KREW
/

test-github-CI-for-i18n-agent

Running

App Files Files Community

wony617 commited on 23 days ago

Commit

22556a8

1 Parent(s): ec613d7

Add agent workflow

Browse files

Files changed (2) hide show

agent/toctree_handler.py +298 -0
translator/prompt_glossary.py +126 -0

agent/toctree_handler.py ADDED Viewed

	@@ -0,0 +1,298 @@

+import yaml
+import requests
+from typing import Dict, List, Any
+import os
+class TocTreeHandler:
+    def __init__(self):
+        self.en_toctree_url = "https://raw.githubusercontent.com/huggingface/transformers/main/docs/source/en/_toctree.yml"
+        self.ko_toctree_url = "https://raw.githubusercontent.com/huggingface/transformers/main/docs/source/ko/_toctree.yml"
+        self.local_docs_path = "docs/source/ko"
+    def fetch_toctree(self, url: str) -> Dict[str, Any]:
+        """Fetch and parse YAML from URL"""
+        response = requests.get(url)
+        response.raise_for_status()
+        return yaml.safe_load(response.text)
+    def get_en_toctree(self) -> Dict[str, Any]:
+        """Get English toctree structure"""
+        return self.fetch_toctree(self.en_toctree_url)
+    def get_ko_toctree(self) -> Dict[str, Any]:
+        """Get Korean toctree structure"""
+        return self.fetch_toctree(self.ko_toctree_url)
+    def extract_title_mappings(self, en_data: List[Dict], ko_data: List[Dict]) -> Dict[str, str]:
+        """Extract title mappings between English and Korean"""
+        mappings = {}
+        def process_section(en_section: Dict, ko_section: Dict):
+            if 'local' in en_section and 'local' in ko_section:
+                if en_section['local'] == ko_section['local']:
+                    en_title = en_section.get('title', '')
+                    ko_title = ko_section.get('title', '')
+                    if en_title and ko_title:
+                        mappings[en_title] = ko_title
+            if 'sections' in en_section and 'sections' in ko_section:
+                en_sections = en_section['sections']
+                ko_sections = ko_section['sections']
+                for i, en_sub in enumerate(en_sections):
+                    if i < len(ko_sections):
+                        process_section(en_sub, ko_sections[i])
+        for i, en_item in enumerate(en_data):
+            if i < len(ko_data):
+                process_section(en_item, ko_data[i])
+        return mappings
+    def translate_title(self, en_title: str) -> str:
+        """Translate English title to Korean using LLM"""
+        try:
+            from translator.content import llm_translate
+            prompt = f"""Translate the following English documentation title to Korean. Return only the translated title, nothing else.
+English title: {en_title}
+Korean title:"""
+            callback_result, translated_title = llm_translate(prompt)
+            return translated_title.strip()
+        except Exception as e:
+            print(f"Error translating title '{en_title}': {e}")
+            return en_title
+    def create_local_toctree(self, en_title: str, local_file_path: str) -> Dict[str, str]:
+        """Create local toctree entry with Korean title and local path"""
+        try:
+            # First try to get Korean title from existing mappings
+            en_data = self.get_en_toctree()
+            ko_data = self.get_ko_toctree()
+            title_mappings = self.extract_title_mappings(en_data, ko_data)
+            ko_title = title_mappings.get(en_title)
+            # If no existing mapping, translate the title
+            if not ko_title:
+                ko_title = self.translate_title(en_title)
+            return {
+                'local': local_file_path,
+                'title': ko_title
+            }
+        except Exception as e:
+            print(f"Error creating local toctree: {e}")
+            return {
+                'local': local_file_path,
+                'title': en_title
+            }
+    def update_local_toctree_file(self, new_entries: List[Dict[str, str]]):
+        """Update or create local _toctree.yml file"""
+        toctree_path = os.path.join(self.local_docs_path, "_toctree.yml")
+        os.makedirs(self.local_docs_path, exist_ok=True)
+        if os.path.exists(toctree_path):
+            with open(toctree_path, 'r', encoding='utf-8') as f:
+                existing_data = yaml.safe_load(f) or []
+        else:
+            existing_data = []
+        for entry in new_entries:
+            if entry not in existing_data:
+                existing_data.append(entry)
+        with open(toctree_path, 'w', encoding='utf-8') as f:
+            yaml.dump(existing_data, f, allow_unicode=True, default_flow_style=False, sort_keys=False)
+    def create_updated_toctree_with_llm(self, en_toctree_yaml: str, ko_toctree_yaml: str, target_local: str) -> dict:
+        """Use LLM to create updated Korean toctree with new entry at correct position"""
+        try:
+            from translator.content import llm_translate
+            prompt = f"""You are given English and Korean toctree YAML structures. You need to:
+1. Find the entry(local, title) with `- local: {target_local}` in the English toctree
+2. Translate its title to Korean
+3. Insert this new entry into the Korean toctree at the same position as it appears in the English toctree
+4. Return the complete updated Korean toctree
+English toctree YAML:
+```yaml
+{en_toctree_yaml}
+```
+Current Korean toctree YAML:
+```yaml
+{ko_toctree_yaml}
+```
+Target local path to add: "{target_local}"
+Return the complete updated Korean toctree in YAML format:
+```yaml
+# Updated Korean toctree with new entry inserted at correct position
+[complete toctree structure here]
+```
+Important positioning rules:
+- Find the exact position (index and nesting level) of the target entry in the English toctree
+- Count from the beginning: if it's the 5th item in English toctree, it should be the 5th item in Korean toctree
+- If it's inside a 'sections' array, maintain that nesting structure
+- Keep all existing Korean entries in their current positions
+- Insert the new Korean entry at the exact same position as the English entry
+- If there are gaps in positions (missing entries), maintain those gaps
+- Preserve the exact YAML structure: {{local: "path", title: "title"}} or {{local: "path", title: "title", sections: [...]}}
+Example: If English entry is at position [2] (3rd item), insert Korean entry at position [2] in Korean toctree
+Example: If English entry is at position [1]['sections'][0] (1st item in sections of 2nd entry), insert at same nested position"""
+            callback_result, response = llm_translate(prompt)
+            # Parse YAML response
+            response = response.strip()
+            try:
+                # Extract YAML content between ```yaml and ```
+                if "```yaml" in response:
+                    yaml_start = response.find("```yaml") + 7
+                    yaml_end = response.find("```", yaml_start)
+                    yaml_content = response[yaml_start:yaml_end].strip()
+                else:
+                    yaml_content = response
+                updated_ko_toctree = yaml.safe_load(yaml_content)
+                return updated_ko_toctree
+            except Exception as e:
+                print(f"Failed to parse LLM YAML response: {e}")
+                print(f"Response was: {response}")
+                return None
+        except Exception as e:
+            print(f"Error using LLM to create updated toctree: {e}")
+            return None
+    def process_pr_commit(self, en_titles: List[str], local_paths: List[str], filepath: str):
+        """Process PR commit by using LLM to create complete updated Korean toctree"""
+        # Get filepath without prefix
+        filepath_without_prefix = filepath.replace("docs/source/en/", "").replace(".md", "")
+        # Get English and Korean toctrees as YAML strings
+        en_toctree = self.get_en_toctree()
+        ko_toctree = self.get_ko_toctree()
+        en_toctree_yaml = yaml.dump(en_toctree, allow_unicode=True, default_flow_style=False)
+        ko_toctree_yaml = yaml.dump(ko_toctree, allow_unicode=True, default_flow_style=False)
+        # Use LLM to create updated Korean toctree
+        updated_ko_toctree = self.create_updated_toctree_with_llm(en_toctree_yaml, ko_toctree_yaml, filepath_without_prefix)
+        if not updated_ko_toctree:
+            print(f"Failed to create updated Korean toctree for local: {filepath_without_prefix}")
+            return []
+        print(f"LLM successfully updated Korean toctree")
+        # Store the updated toctree for commit
+        self.updated_ko_toctree = updated_ko_toctree
+        print(f"Updated Korean toctree has {len(updated_ko_toctree)} items")
+        return []
+    def commit_and_push_toctree(self, pr_agent, owner: str, repo_name: str, branch_name: str):
+        """Commit and push toctree updates as a separate commit"""
+        try:
+            # Use the updated toctree created by LLM
+            if not hasattr(self, 'updated_ko_toctree') or not self.updated_ko_toctree:
+                print("No updated Korean toctree available")
+                return {"status": "error", "message": "No updated toctree to commit"}
+            ko_data = self.updated_ko_toctree
+            # Convert to YAML string
+            toctree_content = yaml.dump(ko_data, allow_unicode=True, default_flow_style=False, sort_keys=False)
+            # Create toctree commit message
+            commit_message = "docs: update Korean documentation table of contents - test"
+            # Commit toctree file
+            file_result = pr_agent.create_or_update_file(
+                owner=owner,
+                repo_name=repo_name,
+                path="docs/source/ko/_toctree.yml",
+                message=commit_message,
+                content=toctree_content,
+                branch_name=branch_name
+            )
+            if file_result.startswith("SUCCESS"):
+                return {
+                    "status": "success",
+                    "message": f"Toctree committed successfully: {file_result}",
+                    "commit_message": commit_message
+                }
+            else:
+                return {
+                    "status": "error",
+                    "message": f"Toctree commit failed: {file_result}"
+                }
+        except Exception as e:
+            return {
+                "status": "error",
+                "message": f"Error committing toctree: {str(e)}"
+            }
+    def update_toctree_after_translation(
+        self,
+        translation_result: dict,
+        en_title: str,
+        filepath: str,
+        pr_agent,
+        github_config: dict
+    ) -> dict:
+        """Update toctree after successful translation PR.
+        Args:
+            translation_result: Result from translation PR workflow
+            en_title: English title for toctree mapping
+            filepath: Original file path
+            pr_agent: GitHub PR agent instance
+            github_config: GitHub configuration dictionary
+        Returns:
+            Dictionary with toctree update result
+        """
+        if translation_result["status"] == "error" or not en_title:
+            return None
+        try:
+            local_path = filepath.split("/")[-1].replace(".md", "")
+            # Create new toctree entries
+            new_entries = self.process_pr_commit([en_title], [local_path], filepath)
+            print("self.updated_ko_toctree = updated_ko_toctree:", self.updated_ko_toctree)
+            # Commit toctree as separate commit
+            return self.commit_and_push_toctree(
+                pr_agent=pr_agent,
+                owner=github_config["owner"],
+                repo_name=github_config["repo_name"],
+                branch_name=translation_result["branch"]
+            )
+            # return {
+            #     'status': 'success',
+            #     'message': 'Toctree committed successfully: SUCCESS: File updated - docs/source/ko/_toctree.yml',
+            #     'commit_message': 'docs: update Korean documentation table of contents'
+            #     }
+        except Exception as e:
+            return {
+                "status": "error",
+                "message": f"Error updating toctree: {str(e)}"
+            }

translator/prompt_glossary.py ADDED Viewed

	@@ -0,0 +1,126 @@

+PROMPT_WITH_GLOSSARY = """
+You have a glossary of terms with their Korean translations. When translating a sentence, you need to check if any of the words in the sentence are in the glossary, and if so, translate them according to the provided Korean terms. Here is the glossary:
+🔹 Glossary (English → Korean):
+- revision: 개정
+- method: 메소드
+- secrets: 비밀값
+- search helper: 검색 헬퍼
+- logging level: 로그 레벨
+- workflow: 워크플로우
+- corner case: 코너 케이스
+- tokenization: 토큰화
+- architecture: 아키텍처
+- attention mask: 어텐션 마스크
+- backbone: 백본
+- argmax: argmax
+- beam search: 빔 서치
+- clustering: 군집화
+- configuration: 구성
+- context: 문맥
+- cross entropy: 교차 엔트로피
+- cross-attention: 크로스 어텐션
+- dictionary: 딕셔너리
+- entry: 엔트리
+- few shot: 퓨샷
+- flatten: 평탄화
+- ground truth: 정답
+- head: 헤드
+- helper function: 헬퍼 함수
+- image captioning: 이미지 캡셔닝
+- image patch: 이미지 패치
+- inference: 추론
+- instance: 인스턴스
+- Instantiate: 인스턴스화
+- knowledge distillation: 지식 증류
+- labels: 레이블
+- large language models (LLM): 대규모 언어 모델
+- layer: 레이어
+- learning rate scheduler: Learning Rate Scheduler
+- localization: 로컬리제이션
+- log mel-filter bank: 로그 멜 필터 뱅크
+- look-up table: 룩업 테이블
+- loss function: 손실 함수
+- machine learning: 머신 러닝
+- mapping: 매핑
+- masked language modeling (MLM): 마스크드 언어 모델
+- malware: 악성코드
+- metric: 지표
+- mixed precision: 혼합 정밀도
+- modality: 모달리티
+- monolingual model: 단일 언어 모델
+- multi gpu: 다중 GPU
+- multilingual model: 다국어 모델
+- parsing: 파싱
+- perplexity (PPL): 펄플렉서티(Perplexity)
+- pipeline: 파이프라인
+- pixel values: 픽셀 값
+- pooling: 풀링
+- position IDs: 위치 ID
+- preprocessing: 전처리
+- prompt: 프롬프트
+- pythonic: 파이써닉
+- query: 쿼리
+- question answering: 질의 응답
+- raw audio waveform: 원시 오디오 파형
+- recurrent neural network (RNN): 순환 신경망
+- accelerator: 가속기
+- Accelerate: Accelerate
+- architecture: 아키텍처
+- arguments: 인수
+- attention mask: 어텐션 마스크
+- augmentation: 증강
+- autoencoding models: 오토인코딩 모델
+- autoregressive models: 자기회귀 모델
+- backward: 역방향
+- bounding box: 바운딩 박스
+- causal language modeling: 인과적 언어 모델링(causal language modeling)
+- channel: 채널
+- checkpoint: 체크포인트(checkpoint)
+- chunk: 묶음
+- computer vision: 컴퓨터 비전
+- convolution: 합성곱
+- crop: 자르기
+- custom: 사용자 정의
+- customize: 맞춤 설정하다
+- data collator: 데이터 콜레이터
+- dataset: 데이터 세트
+- decoder input IDs: 디코더 입력 ID
+- decoder models: 디코더 모델
+- deep learning (DL): 딥러닝
+- directory: 디렉터리
+- distributed training: 분산 학습
+- downstream: 다운스트림
+- encoder models: 인코더 모델
+- entity: 개체
+- epoch: 에폭
+- evaluation method: 평가 방법
+- feature extraction: 특성 추출
+- feature matrix: 특성 행렬(feature matrix)
+- fine-tunning: 미세 조정
+- finetuned models: 미세 조정 모델
+- hidden state: 은닉 상태
+- hyperparameter: 하이퍼파라미터
+- learning: 학습
+- load: 가져오다
+- method: 메소드
+- optimizer: 옵티마이저
+- pad (padding): 패드 (패딩)
+- parameter: 매개변수
+- pretrained model: 사전훈련된 모델
+- separator (* [SEP]를 부르는 이름): 분할 토큰
+- sequence: 시퀀스
+- silent error: 조용한 오류
+- token: 토큰
+- tokenizer: 토크나이저
+- training: 훈련
+- workflow: 워크플로우
+📌 Instructions:
+1. Whenever a source term from the glossary appears **in any form** (full match or partial match within a larger phrase), **replace it with the exact Korean translation** from the glossary, keeping the rest of the phrase in Korean.
+   - Example: “Attention Interface” → “어텐션 인터페이스”
+   - Example: “Architecture details” → “아키텍처 상세”
+2. Non-glossary words should be translated naturally, respecting context and technical nuance.
+Please revise the translated sentences accordingly using the terms provided in this glossary.
+"""