Spaces:

bor
/

aozora-bunko-preprocessor

Sleeping

App Files Files Community

Bor Hodošček commited on Jun 15

Commit

15986f6

unverified ·

1 Parent(s): 8cef38d

chore: documentation fixes

Browse files

Files changed (1) hide show

app.py +36 -26

app.py CHANGED Viewed

@@ -26,35 +26,44 @@ app = marimo.App()
 def _(mo):
     mo.md(
         rf"""
-    # Aozora Bunko Text Processing Pipeline Demo
     ### Summary
     1. Upload a text file from Aozora Bunko (or use the default sample).
     2. Preprocess using customizable regex patterns.
     3. Preview the first and last 50 lines of the cleaned text.
     4. Download the cleaned text.
-    5. Process the XHTML version with a Python library.
     6. Compare against the regex variant.
-    6. Define token matching patterns.
     7. Visualize token matches.
-    8. Define dependency matching patterns.
     9. Visualize dependency matches.
     ### 概要
     1. 青空文庫のテキストファイルをアップロードする（またはデフォルトサンプルを利用する）。
     2. 編集可能な正規表現で前処理する。
-    3. 前処理済みテキストの先頭50行と末尾50行をプレビューする。
     4. 前処理済みテキストをダウンロードする。
-    5. XHTML版をPythonのパッケージで処理する。
     6. 正規表現処理版と比較する。
-    7. トークンマッチング用パターンを定義する。
     8. トークンマッチ結果を可視化する。
-    9. 係り受け（依存）関係マッチング用パターンを定義する。
     10. 係り受け関係マッチ結果を可視化する。
-    {mo.callout("By default, this demo uses Natsume Soseki's _‘Wagahai wa neko de aru’_")}
     """
     )
     return
@@ -76,10 +85,6 @@ def _():
 @app.cell
 def upload_aozora_text(mo):
-    """
-    UI element to upload an Aozora‐Bunko text file.
-    Falls back to local file if none is provided.
-    """
     aozora_file = mo.ui.file(label="Upload Aozora-Bunko text (.txt)", multiple=False)
     return (aozora_file,)
@@ -92,7 +97,7 @@ def select_encoding(mo):
     encoding = mo.ui.dropdown(
         options=["shift-jis", "utf-8"],
         value="shift-jis",
-        label="Text file encoding",
         full_width=False,
     )
     return (encoding,)
@@ -145,22 +150,22 @@ def show_raw_head(mo, text_raw):
 def regex_inputs(mo):
     ruby_pattern = mo.ui.text(
         value=r"《[^》]+》",
-        label="Ruby‐annotation regex",
         full_width=True,
     )
     ruby_bar_pattern = mo.ui.text(
         value=r"｜",
-        label="Ruby‐bar regex",
         full_width=True,
     )
     annotation_pattern = mo.ui.text(
         value=r"［＃[^］]+?］",
-        label="Inline‐annotation regex",
         full_width=True,
     )
     hajime_pattern = mo.ui.text(
         value=r"-{55}(.|\n)+?-{55}",
-        label="Start‐marker regex",
         full_width=True,
     )
     owari_pattern = mo.ui.text(
@@ -168,7 +173,7 @@ def regex_inputs(mo):
             r"^[　【]?(底本：|訳者あとがき|この翻訳は|この作品.*翻訳|"
             r"この翻訳.*全訳)"
         ),
-        label="End‐marker regex",
         full_width=True,
     )
@@ -217,7 +222,7 @@ def clean_aozora(
     def clean_text(text: str) -> tuple[str, str, str]:
         """青空文庫テキスト形式の文字列textを入力とし，改行方式の統一，ルビーと各種のアノーテーションの削除，
-        青空文庫特有の"""
         title, author, text = (text.split("\n", 2) + ["", ""])[:3]
@@ -269,7 +274,7 @@ def download_cleaned_text(author, cleaned_text, mo, title):
         mimetype="text/plain",
     )
     mo.md(f"""
-    前処理済みファイルのダウンロード：
     {download_link}
     """)
     return
@@ -341,7 +346,7 @@ def _(aozora_xhtml_processed_text, author, mo, title):
         mimetype="text/plain",
     )
     mo.md(f"""
-    HTML版の前処理済みファイルをダウンロード：
     {xhtml_download_link}
     """)
     return
@@ -437,6 +442,7 @@ def toggle_diff(mo):
 @app.cell
 def compare_preprocessed_vs_old(
     aozora_xhtml_processed_text,
     cleaned_text,
     diff_changes,
@@ -453,10 +459,14 @@ def compare_preprocessed_vs_old(
         diff_result = diff_changes(
             cleaned_text, aozora_xhtml_processed_text, auto_display=False
         )
-    # else:
-    #     diff_result = mo.md("Diff comparison is turned off.")
-    diff_result
     return

 def _(mo):
     mo.md(
         rf"""
+    # Aozora Bunko Text Processing Pipeline Demo / 青空文庫テキストの前処理パイプラインデモ
     ### Summary
+    This notebook allows you to upload, preprocess, compare, visualize and analyze Aozora Bunko texts.
     1. Upload a text file from Aozora Bunko (or use the default sample).
     2. Preprocess using customizable regex patterns.
     3. Preview the first and last 50 lines of the cleaned text.
     4. Download the cleaned text.
+    5. Process the XHTML version with the `aozora-corpus-generator` Python library for comparison.
     6. Compare against the regex variant.
+    6. Define token matching patterns (not possible in App mode).
     7. Visualize token matches.
+    8. Define dependency matching patterns (not possible in App mode).
     9. Visualize dependency matches.
     ### 概要
+    このノートブックでは以下の手順で青空文庫テキストを読み込み、前処理、解析、可視化を行います。
     1. 青空文庫のテキストファイルをアップロードする（またはデフォルトサンプルを利用する）。
     2. 編集可能な正規表現で前処理する。
+    3. 前処理済みテキストの先頭50行と末尾50行をプレビューし、前処理が正常に本文以外のテキストを除外したか確認する。
     4. 前処理済みテキストをダウンロードする。
+    5. 比較のため、XHTML版をPythonのパッケージで処理する。
     6. 正規表現処理版と比較する。
+    7. トークンマッチング用パターンを定義する（アプリの場合は編集不可）。
     8. トークンマッチ結果を可視化する。
+    9. 係り受け（依存）関係マッチング用パターンを定義する（アプリの場合は編集不可）。
     10. 係り受け関係マッチ結果を可視化する。
+    {
+            mo.callout('''
+            -  By default, this demo uses Natsume Soseki's _‘Wagahai wa neko de aru’_
+            -   ファイルをアップロードしない場合は、デフォルトで夏目漱石『吾輩は猫である』が使用されます。
+            ''')
+        }
     """
     )
     return
 @app.cell
 def upload_aozora_text(mo):
     aozora_file = mo.ui.file(label="Upload Aozora-Bunko text (.txt)", multiple=False)
     return (aozora_file,)
     encoding = mo.ui.dropdown(
         options=["shift-jis", "utf-8"],
         value="shift-jis",
+        label="Text file encoding / 文字コード",
         full_width=False,
     )
     return (encoding,)
 def regex_inputs(mo):
     ruby_pattern = mo.ui.text(
         value=r"《[^》]+》",
+        label="ルビ",
         full_width=True,
     )
     ruby_bar_pattern = mo.ui.text(
         value=r"｜",
+        label="ルビのかかる範囲を示す記号",
         full_width=True,
     )
     annotation_pattern = mo.ui.text(
         value=r"［＃[^］]+?］",
+        label="注釈・アノテーション",
         full_width=True,
     )
     hajime_pattern = mo.ui.text(
         value=r"-{55}(.|\n)+?-{55}",
+        label="青空文庫のヘッダー",
         full_width=True,
     )
     owari_pattern = mo.ui.text(
             r"^[　【]?(底本：|訳者あとがき|この翻訳は|この作品.*翻訳|"
             r"この翻訳.*全訳)"
         ),
+        label="青空文庫のフッター",
         full_width=True,
     )
     def clean_text(text: str) -> tuple[str, str, str]:
         """青空文庫テキスト形式の文字列textを入力とし，改行方式の統一，ルビーと各種のアノーテーションの削除，
+        青空文庫特有のヘッダーとフッターを取り除く処理を行う。"""
         title, author, text = (text.split("\n", 2) + ["", ""])[:3]
         mimetype="text/plain",
     )
     mo.md(f"""
+    前処理済みファイルのダウンロード (UTF-8)：
     {download_link}
     """)
     return
         mimetype="text/plain",
     )
     mo.md(f"""
+    HTML版の前処理済みファイルをダウンロード (UTF-8)：
     {xhtml_download_link}
     """)
     return
 @app.cell
 def compare_preprocessed_vs_old(
+    mo,
     aozora_xhtml_processed_text,
     cleaned_text,
     diff_changes,
         diff_result = diff_changes(
             cleaned_text, aozora_xhtml_processed_text, auto_display=False
         )
+    mo.md(f"""
+    -   赤: 正規表現版のみにある文字列
+    -   青: HTML版のみにある文字列
+    {diff_result}
+    """)
     return