Spaces:

arxify
/

RVC-beta-v2-0618

Sleeping

App Files Files Community

RVC-beta-v2-0618 / runtime /Lib /site-packages /charset_normalizer-2.1.1.dist-info /METADATA

arxify

Upload folder using huggingface_hub

ba2f5d6 about 2 years ago

raw

history blame contribute delete

11.9 kB

	Metadata-Version: 2.1
	Name: charset-normalizer
	Version: 2.1.1
	Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
	Home-page: https://github.com/ousret/charset_normalizer
	Author: Ahmed TAHRI @Ousret
	Author-email: ahmed.tahri@cloudnursery.dev
	License: MIT
	Project-URL: Bug Reports, https://github.com/Ousret/charset_normalizer/issues
	Project-URL: Documentation, https://charset-normalizer.readthedocs.io/en/latest
	Keywords: encoding,i18n,txt,text,charset,charset-detector,normalization,unicode,chardet
	Classifier: Development Status :: 5 - Production/Stable
	Classifier: License :: OSI Approved :: MIT License
	Classifier: Intended Audience :: Developers
	Classifier: Topic :: Software Development :: Libraries :: Python Modules
	Classifier: Operating System :: OS Independent
	Classifier: Programming Language :: Python
	Classifier: Programming Language :: Python :: 3
	Classifier: Programming Language :: Python :: 3.6
	Classifier: Programming Language :: Python :: 3.7
	Classifier: Programming Language :: Python :: 3.8
	Classifier: Programming Language :: Python :: 3.9
	Classifier: Programming Language :: Python :: 3.10
	Classifier: Programming Language :: Python :: 3.11
	Classifier: Topic :: Text Processing :: Linguistic
	Classifier: Topic :: Utilities
	Classifier: Programming Language :: Python :: Implementation :: PyPy
	Classifier: Typing :: Typed
	Requires-Python: >=3.6.0
	Description-Content-Type: text/markdown
	License-File: LICENSE
	Provides-Extra: unicode_backport
	Requires-Dist: unicodedata2 ; extra == 'unicode_backport'


	<h1 align="center">Charset Detection, for Everyone 👋 <a href="https://twitter.com/intent/tweet?text=The%20Real%20First%20Universal%20Charset%20%26%20Language%20Detector&url=https://www.github.com/Ousret/charset_normalizer&hashtags=python,encoding,chardet,developers"><img src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social"/></a></h1>

	<p align="center">
	<sup>The Real First Universal Charset Detector</sup><br>
	<a href="https://pypi.org/project/charset-normalizer">
	<img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" />
	</a>
	<a href="https://codecov.io/gh/Ousret/charset_normalizer">
	<img src="https://codecov.io/gh/Ousret/charset_normalizer/branch/master/graph/badge.svg" />
	</a>
	<a href="https://pepy.tech/project/charset-normalizer/">
	<img alt="Download Count Total" src="https://pepy.tech/badge/charset-normalizer/month" />
	</a>
	</p>

	> A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
	> I'm trying to resolve the issue by taking a new approach.
	> All IANA character set names for which the Python core library provides codecs are supported.

	<p align="center">
	>>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
	</p>

	This project offers you an alternative to Universal Charset Encoding Detector, also known as Chardet.

	\| Feature \| [Chardet](https://github.com/chardet/chardet) \| Charset Normalizer \| [cChardet](https://github.com/PyYoshi/cChardet) \|
	\| ------------- \| :-------------: \| :------------------: \| :------------------: \|
	\| `Fast` \| ❌<br> \| ✅<br> \| ✅ <br> \|
	\| `Universal**` \| ❌ \| ✅ \| ❌ \|
	\| `Reliable` without distinguishable standards \| ❌ \| ✅ \| ✅ \|
	\| `Reliable` with distinguishable standards \| ✅ \| ✅ \| ✅ \|
	\| `License` \| LGPL-2.1<br>_restrictive_ \| MIT \| MPL-1.1<br>_restrictive_ \|
	\| `Native Python` \| ✅ \| ✅ \| ❌ \|
	\| `Detect spoken language` \| ❌ \| ✅ \| N/A \|
	\| `UnicodeDecodeError Safety` \| ❌ \| ✅ \| ❌ \|
	\| `Whl Size` \| 193.6 kB \| 39.5 kB \| ~200 kB \|
	\| `Supported Encoding` \| 33 \| :tada: [93](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) \| 40

	<p align="center">
	<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>

	\\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br>
	Did you got there because of the logs? See [https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html](https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html)

	## ⭐ Your support

	Fork, test-it, star-it, submit your ideas! We do listen.

	## ⚡ Performance

	This package offer better performance than its counterpart Chardet. Here are some numbers.

	\| Package \| Accuracy \| Mean per file (ms) \| File per sec (est) \|
	\| ------------- \| :-------------: \| :------------------: \| :------------------: \|
	\| [chardet](https://github.com/chardet/chardet) \| 86 % \| 200 ms \| 5 file/sec \|
	\| charset-normalizer \| 98 % \| 39 ms \| 26 file/sec \|

	\| Package \| 99th percentile \| 95th percentile \| 50th percentile \|
	\| ------------- \| :-------------: \| :------------------: \| :------------------: \|
	\| [chardet](https://github.com/chardet/chardet) \| 1200 ms \| 287 ms \| 23 ms \|
	\| charset-normalizer \| 400 ms \| 200 ms \| 15 ms \|

	Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.

	> Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
	> And yes, these results might change at any time. The dataset can be updated to include more files.
	> The actual delays heavily depends on your CPU capabilities. The factors should remain the same.
	> Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability
	> (eg. Supported Encoding) Challenge-them if you want.

	[cchardet](https://github.com/PyYoshi/cChardet) is a non-native (cpp binding) and unmaintained faster alternative with
	a better accuracy than chardet but lower than this package. If speed is the most important factor, you should try it.

	## ✨ Installation

	Using PyPi for latest stable
	```sh
	pip install charset-normalizer -U
	```

	If you want a more up-to-date `unicodedata` than the one available in your Python setup.
	```sh
	pip install charset-normalizer[unicode_backport] -U
	```

	## 🚀 Basic Usage

	### CLI
	This package comes with a CLI.

	```
	usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
	file [file ...]

	The Real First Universal Charset Detector. Discover originating encoding used
	on text file. Normalize text to unicode.

	positional arguments:
	files File(s) to be analysed

	optional arguments:
	-h, --help show this help message and exit
	-v, --verbose Display complementary information about file if any.
	Stdout will contain logs about the detection process.
	-a, --with-alternative
	Output complementary possibilities if any. Top-level
	JSON WILL be a list.
	-n, --normalize Permit to normalize input file. If not set, program
	does not write anything.
	-m, --minimal Only output the charset detected to STDOUT. Disabling
	JSON output.
	-r, --replace Replace file when trying to normalize it instead of
	creating a new one.
	-f, --force Replace file without asking if you are sure, use this
	flag with caution.
	-t THRESHOLD, --threshold THRESHOLD
	Define a custom maximum amount of chaos allowed in
	decoded content. 0. <= chaos <= 1.
	--version Show version information and exit.
	```

	```bash
	normalizer ./data/sample.1.fr.srt
	```

	:tada: Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.

	```json
	{
	"path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
	"encoding": "cp1252",
	"encoding_aliases": [
	"1252",
	"windows_1252"
	],
	"alternative_encodings": [
	"cp1254",
	"cp1256",
	"cp1258",
	"iso8859_14",
	"iso8859_15",
	"iso8859_16",
	"iso8859_3",
	"iso8859_9",
	"latin_1",
	"mbcs"
	],
	"language": "French",
	"alphabets": [
	"Basic Latin",
	"Latin-1 Supplement"
	],
	"has_sig_or_bom": false,
	"chaos": 0.149,
	"coherence": 97.152,
	"unicode_path": null,
	"is_preferred": true
	}
	```

	### Python
	Just print out normalized text
	```python
	from charset_normalizer import from_path

	results = from_path('./my_subtitle.srt')

	print(str(results.best()))
	```

	Normalize any text file
	```python
	from charset_normalizer import normalize
	try:
	normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt
	except IOError as e:
	print('Sadly, we are unable to perform charset normalization.', str(e))
	```

	Upgrade your code without effort
	```python
	from charset_normalizer import detect
	```

	The above code will behave the same as chardet. We ensure that we offer the best (reasonable) BC result possible.

	See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/)

	## 😇 Why

	When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
	reliable alternative using a completely different method. Also! I never back down on a good challenge!

	I don't care about the originating charset encoding, because two different tables can
	produce two identical rendered string.
	What I want is to get readable text, the best I can.

	In a way, I'm brute forcing text decoding. How cool is that ? 😎

	Don't confuse package ftfy with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.

	## 🍰 How

	- Discard all charset encoding table that could not fit the binary content.
	- Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding.
	- Extract matches with the lowest mess detected.
	- Additionally, we measure coherence / probe for a language.

	Wait a minute, what is chaos/mess and coherence according to YOU ?

	Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then
	I established some ground rules about what is obvious when it seems like a mess.
	I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
	improve or rewrite it.

	Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
	that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.

	## ⚡ Known limitations

	- Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
	- Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.

	## 👤 Contributing

	Contributions, issues and feature requests are very much welcome.<br />
	Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute.

	## 📝 License

	Copyright © 2019 [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
	This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.

	Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)