Spaces:
Sleeping
Sleeping
Metadata-Version: 2.1 | |
Name: charset-normalizer | |
Version: 2.1.1 | |
Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet. | |
Home-page: https://github.com/ousret/charset_normalizer | |
Author: Ahmed TAHRI @Ousret | |
Author-email: ahmed.tahri@cloudnursery.dev | |
License: MIT | |
Project-URL: Bug Reports, https://github.com/Ousret/charset_normalizer/issues | |
Project-URL: Documentation, https://charset-normalizer.readthedocs.io/en/latest | |
Keywords: encoding,i18n,txt,text,charset,charset-detector,normalization,unicode,chardet | |
Classifier: Development Status :: 5 - Production/Stable | |
Classifier: License :: OSI Approved :: MIT License | |
Classifier: Intended Audience :: Developers | |
Classifier: Topic :: Software Development :: Libraries :: Python Modules | |
Classifier: Operating System :: OS Independent | |
Classifier: Programming Language :: Python | |
Classifier: Programming Language :: Python :: 3 | |
Classifier: Programming Language :: Python :: 3.6 | |
Classifier: Programming Language :: Python :: 3.7 | |
Classifier: Programming Language :: Python :: 3.8 | |
Classifier: Programming Language :: Python :: 3.9 | |
Classifier: Programming Language :: Python :: 3.10 | |
Classifier: Programming Language :: Python :: 3.11 | |
Classifier: Topic :: Text Processing :: Linguistic | |
Classifier: Topic :: Utilities | |
Classifier: Programming Language :: Python :: Implementation :: PyPy | |
Classifier: Typing :: Typed | |
Requires-Python: >=3.6.0 | |
Description-Content-Type: text/markdown | |
License-File: LICENSE | |
Provides-Extra: unicode_backport | |
Requires-Dist: unicodedata2 ; extra == 'unicode_backport' | |
<h1 align="center">Charset Detection, for Everyone π <a href="https://twitter.com/intent/tweet?text=The%20Real%20First%20Universal%20Charset%20%26%20Language%20Detector&url=https://www.github.com/Ousret/charset_normalizer&hashtags=python,encoding,chardet,developers"><img src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social"/></a></h1> | |
<p align="center"> | |
<sup>The Real First Universal Charset Detector</sup><br> | |
<a href="https://pypi.org/project/charset-normalizer"> | |
<img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" /> | |
</a> | |
<a href="https://codecov.io/gh/Ousret/charset_normalizer"> | |
<img src="https://codecov.io/gh/Ousret/charset_normalizer/branch/master/graph/badge.svg" /> | |
</a> | |
<a href="https://pepy.tech/project/charset-normalizer/"> | |
<img alt="Download Count Total" src="https://pepy.tech/badge/charset-normalizer/month" /> | |
</a> | |
</p> | |
> A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`, | |
> I'm trying to resolve the issue by taking a new approach. | |
> All IANA character set names for which the Python core library provides codecs are supported. | |
<p align="center"> | |
>>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">π Try Me Online Now, Then Adopt Me π </a> <<<<< | |
</p> | |
This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**. | |
| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) | | |
| ------------- | :-------------: | :------------------: | :------------------: | | |
| `Fast` | β<br> | β <br> | β <br> | | |
| `Universal**` | β | β | β | | |
| `Reliable` **without** distinguishable standards | β | β | β | | |
| `Reliable` **with** distinguishable standards | β | β | β | | |
| `License` | LGPL-2.1<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ | | |
| `Native Python` | β | β | β | | |
| `Detect spoken language` | β | β | N/A | | |
| `UnicodeDecodeError Safety` | β | β | β | | |
| `Whl Size` | 193.6 kB | 39.5 kB | ~200 kB | | |
| `Supported Encoding` | 33 | :tada: [93](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 | |
<p align="center"> | |
<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/> | |
*\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br> | |
Did you got there because of the logs? See [https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html](https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html) | |
## β Your support | |
*Fork, test-it, star-it, submit your ideas! We do listen.* | |
## β‘ Performance | |
This package offer better performance than its counterpart Chardet. Here are some numbers. | |
| Package | Accuracy | Mean per file (ms) | File per sec (est) | | |
| ------------- | :-------------: | :------------------: | :------------------: | | |
| [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec | | |
| charset-normalizer | **98 %** | **39 ms** | 26 file/sec | | |
| Package | 99th percentile | 95th percentile | 50th percentile | | |
| ------------- | :-------------: | :------------------: | :------------------: | | |
| [chardet](https://github.com/chardet/chardet) | 1200 ms | 287 ms | 23 ms | | |
| charset-normalizer | 400 ms | 200 ms | 15 ms | | |
Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload. | |
> Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows. | |
> And yes, these results might change at any time. The dataset can be updated to include more files. | |
> The actual delays heavily depends on your CPU capabilities. The factors should remain the same. | |
> Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability | |
> (eg. Supported Encoding) Challenge-them if you want. | |
[cchardet](https://github.com/PyYoshi/cChardet) is a non-native (cpp binding) and unmaintained faster alternative with | |
a better accuracy than chardet but lower than this package. If speed is the most important factor, you should try it. | |
## β¨ Installation | |
Using PyPi for latest stable | |
```sh | |
pip install charset-normalizer -U | |
``` | |
If you want a more up-to-date `unicodedata` than the one available in your Python setup. | |
```sh | |
pip install charset-normalizer[unicode_backport] -U | |
``` | |
## π Basic Usage | |
### CLI | |
This package comes with a CLI. | |
``` | |
usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD] | |
file [file ...] | |
The Real First Universal Charset Detector. Discover originating encoding used | |
on text file. Normalize text to unicode. | |
positional arguments: | |
files File(s) to be analysed | |
optional arguments: | |
-h, --help show this help message and exit | |
-v, --verbose Display complementary information about file if any. | |
Stdout will contain logs about the detection process. | |
-a, --with-alternative | |
Output complementary possibilities if any. Top-level | |
JSON WILL be a list. | |
-n, --normalize Permit to normalize input file. If not set, program | |
does not write anything. | |
-m, --minimal Only output the charset detected to STDOUT. Disabling | |
JSON output. | |
-r, --replace Replace file when trying to normalize it instead of | |
creating a new one. | |
-f, --force Replace file without asking if you are sure, use this | |
flag with caution. | |
-t THRESHOLD, --threshold THRESHOLD | |
Define a custom maximum amount of chaos allowed in | |
decoded content. 0. <= chaos <= 1. | |
--version Show version information and exit. | |
``` | |
```bash | |
normalizer ./data/sample.1.fr.srt | |
``` | |
:tada: Since version 1.4.0 the CLI produce easily usable stdout result in JSON format. | |
```json | |
{ | |
"path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt", | |
"encoding": "cp1252", | |
"encoding_aliases": [ | |
"1252", | |
"windows_1252" | |
], | |
"alternative_encodings": [ | |
"cp1254", | |
"cp1256", | |
"cp1258", | |
"iso8859_14", | |
"iso8859_15", | |
"iso8859_16", | |
"iso8859_3", | |
"iso8859_9", | |
"latin_1", | |
"mbcs" | |
], | |
"language": "French", | |
"alphabets": [ | |
"Basic Latin", | |
"Latin-1 Supplement" | |
], | |
"has_sig_or_bom": false, | |
"chaos": 0.149, | |
"coherence": 97.152, | |
"unicode_path": null, | |
"is_preferred": true | |
} | |
``` | |
### Python | |
*Just print out normalized text* | |
```python | |
from charset_normalizer import from_path | |
results = from_path('./my_subtitle.srt') | |
print(str(results.best())) | |
``` | |
*Normalize any text file* | |
```python | |
from charset_normalizer import normalize | |
try: | |
normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt | |
except IOError as e: | |
print('Sadly, we are unable to perform charset normalization.', str(e)) | |
``` | |
*Upgrade your code without effort* | |
```python | |
from charset_normalizer import detect | |
``` | |
The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible. | |
See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/) | |
## π Why | |
When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a | |
reliable alternative using a completely different method. Also! I never back down on a good challenge! | |
I **don't care** about the **originating charset** encoding, because **two different tables** can | |
produce **two identical rendered string.** | |
What I want is to get readable text, the best I can. | |
In a way, **I'm brute forcing text decoding.** How cool is that ? π | |
Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode. | |
## π° How | |
- Discard all charset encoding table that could not fit the binary content. | |
- Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding. | |
- Extract matches with the lowest mess detected. | |
- Additionally, we measure coherence / probe for a language. | |
**Wait a minute**, what is chaos/mess and coherence according to **YOU ?** | |
*Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then | |
**I established** some ground rules about **what is obvious** when **it seems like** a mess. | |
I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to | |
improve or rewrite it. | |
*Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought | |
that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design. | |
## β‘ Known limitations | |
- Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters)) | |
- Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content. | |
## π€ Contributing | |
Contributions, issues and feature requests are very much welcome.<br /> | |
Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute. | |
## π License | |
Copyright Β© 2019 [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br /> | |
This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed. | |
Characters frequencies used in this project Β© 2012 [Denny VrandeΔiΔ](http://simia.net/letters/) | |