%PDF- %PDF-
Direktori : /usr/lib/python3/dist-packages/chardet/__pycache__/ |
Current File : //usr/lib/python3/dist-packages/chardet/__pycache__/charsetprober.cpython-312.pyc |
� �d, � �l � d dl Z d dlZd dlmZmZ ddlmZmZ ej d� Z G d� d� Z y)� N)�Optional�Union� )�LanguageFilter�ProbingStates% [a-zA-Z]*[�-�]+[a-zA-Z]*[^a-zA-Z�-�]?c � � e Zd ZdZej fdeddfd�Zdd�Zede e fd�� Zede e fd�� Zd e eef defd �Zedefd�� Zdefd�Zed e eef defd�� Zed e eef defd�� Zed e eef defd�� Zy)� CharSetProbergffffff�?�lang_filter�returnNc � � t j | _ d| _ || _ t j t � | _ y )NT) r � DETECTING�_state�activer �logging� getLogger�__name__�logger)�selfr s �7/usr/lib/python3/dist-packages/chardet/charsetprober.py�__init__zCharSetProber.__init__, s0 � �"�,�,������&����'�'��1��� c �. � t j | _ y �N)r r r �r s r �resetzCharSetProber.reset2 s � �"�,�,��r c � � y r � r s r �charset_namezCharSetProber.charset_name5 s � �r c � � t �r ��NotImplementedErrorr s r �languagezCharSetProber.language9 s � �!�!r �byte_strc � � t �r r )r r# s r �feedzCharSetProber.feed= s � �!�!r c � � | j S r )r r s r �statezCharSetProber.state@ s � ��{�{�r c � � y)Ng r r s r �get_confidencezCharSetProber.get_confidenceD s � �r �bufc �4 � t j dd| � } | S )Ns ([ -])+� )�re�sub)r* s r �filter_high_byte_onlyz#CharSetProber.filter_high_byte_onlyG s � ��f�f�&��c�2��� r c �� � t � }t j | � }|D ]C }|j |dd � |dd }|j � s|dk rd}|j |� �E |S )u7 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [-ÿ] marker: everything else [^a-zA-Z-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. N���� �r, )� bytearray�INTERNATIONAL_WORDS_PATTERN�findall�extend�isalpha)r* �filtered�words�word� last_chars r �filter_international_wordsz(CharSetProber.filter_international_wordsL sv � � �;�� ,�3�3�C�8��� '�D��O�O�D��"�I�&� �R�S� �I��$�$�&�9�w�+>� � ��O�O�I�&� '� �r c �* � t � }d}d}t | � j d� } t | � D ]F \ }}|dk( r|dz }d}�|dk( s�||kD r'|s%|j | || � |j d� d}�H |s|j | |d � |S ) a[ Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Fr �c� >r � <r, TN)r3 � memoryview�cast� enumerater6 )r* r8 �in_tag�prev�curr�buf_chars r �remove_xml_tagszCharSetProber.remove_xml_tagsn s� � � �;��������o�"�"�3�'��'��n� �N�D�(� �4���a�x�����T�!��$�;�v� �O�O�C��T�N�3��O�O�D�)��� �$ � �O�O�C���J�'��r )r N)r � __module__�__qualname__�SHORTCUT_THRESHOLDr �NONEr r �propertyr �strr r"