
Universal text decoding and encoding functions, with additional functions to read and write text files.
Source: smallparts/text/

Module contents


This module defines the following constants:


A dict mapping byte order marks (and the unnecessary, but often encountered UTF-8 signature) to the matching codecs.


'utf-8' as defined in smallparts.constants.UTF_8


'cp1252' as defined in smallparts.constants.CP1252


'\n' as defined in smallparts.constants.LF


A tuple containing '\n' and '\r\n' (as defined in smallparts.constants.LF and smallparts.constants.CRLF)


This module defines the following functions:

smallparts.text.transcode.to_unicode_and_encoding_name(bytestring, from_encoding=None, fallback_encoding=DEFAULT_FALLBACK_ENCODING)

Tries to decode bytestring to a unicode string. Returns a tuple containing the conversion result and the source encoding name when successful. Raises a UnicodeDecodeError if appropriate.
If bytestring is neither a bytes nor a bytearray instance, a TypeError is raised.
If from_encoding is provided, the function explicitly uses that encoding. Otherwise, it tries a simple form of encoding auto-detection by first trying all known byte order marks, then UTF-8, and fallback_encoding as the last resort.
All other functions in this module using these two keyword arguments wrap this function directly or indirectly, so these arguments have the same effect everywhere.

smallparts.text.transcode.to_unicode(bytestring, from_encoding=None, fallback_encoding=DEFAULT_FALLBACK_ENCODING)

Wrapper for the to_unicode_and_encoding_name() function returning only the conversion result.

smallparts.text.transcode.anything_to_unicode(input_object, from_encoding=None, fallback_encoding=DEFAULT_FALLBACK_ENCODING)

Safe wrapper around the to_unicode() function catching the TypeError raised if input_object is neither a bytes nor a bytearray instance, and simply returning the string conversion of the input object.

smallparts.text.transcode.to_bytes(unicode_text, to_encoding=DEFAULT_TARGET_ENCODING)

Encode unicode_text to a bytes representation using to_encoding. Raises a TypeError if unicode_text is not str.

smallparts.text.transcode.anything_to_bytes(input_object, to_encoding=DEFAULT_TARGET_ENCODING, from_encoding=None, fallback_encoding=DEFAULT_FALLBACK_ENCODING)

Safe wrapper around the to_bytes() function catching the TypeError raised if input_object is not a str instance.
In that case, it applies the anything_to_unicode() function to input_object with the from_encoding and fallback_encoding arguments passed through. Then, the result of that conversion is converted using to_bytes().


Shortcut for the explicit to_bytes(unicode_text, to_encoding=’utf-8’) call.

smallparts.text.transcode.anything_to_utf8(input_object, from_encoding=None, fallback_encoding=DEFAULT_FALLBACK_ENCODING)

Wrapper around the explicit anything_to_bytes(input_object, to_encoding=’utf-8’, from_encoding=from_encoding, fallback_encoding=fallback_encoding) call (the from_encoding and fallback_encoding arguments are passed through).

smallparts.text.transcode.fix_double_utf8_transformation(unicode_text, wrong_encoding=DEFAULT_FALLBACK_ENCODING)

Fix duplicate UTF-8 transformation, which is a frequent result of reading UTF-8 encoded text as Latin encoded (CP-1252, ISO-8859-1 or similar), resulting in character sequences like äöü.
This function reverts the effect by re-encoding unicode_text using wrong_encoding and decoding it as UTF-8 again.
Specifying a wrong value for wrong_encoding can either produce unexpected results or raise a UnicodeEncodeError, while trying to fix an already correct text will raise a UnicodeDecodeError.

smallparts.text.transcode.read_from_file(input_file_or_name, from_encoding=None, fallback_encoding=DEFAULT_FALLBACK_ENCODING)

Read input_file_or_name contents and return its contents decoded to unicode. As the argument name suggests, input_file_or_name may be either a file name or a file object.

smallparts.text.transcode.prepare_file_output(unicode_content, to_encoding=DEFAULT_TARGET_ENCODING, line_ending=DEFAULT_LINE_ENDING)

Return unicode_content prepared for binary output to a file (i.e. as bytes, encoded as to_encoding and with line_ending as line ending).
Raises a ValueError if line_ending is not one of SUPPORTED_OUTPUT_LINE_ENDINGS, or a TypeError if unicode_content is neither a unicode string nor a sequence of unicode strings.

smallparts.text.transcode.transcode_file(file_name, to_encoding=DEFAULT_TARGET_ENCODING, from_encoding=None, fallback_encoding=DEFAULT_FALLBACK_ENCODING, line_ending=None)

Transcodes the file with the name file_name to to_encoding.
Raises a ValueError if the file contents are already encoded in to_encoding.
Changes the line endings in the file contents to line_ending if line_ending is one of SUPPORTED_OUTPUT_LINE_ENDINGS.
Renames the original file to a file with the detected encoding appended to the original file name, but before the extension.

Usage examples

>>> from smallparts.text import transcode
>>> transcode.fix_double_utf8_transformation('äöü')
>>> source_text = '« Sacré-Cœur »'
>>> utf8_text = transcode.to_bytes(source_text)
>>> utf8_text
b'\xc2\xab Sacr\xc3\xa9-C\xc5\x93ur \xc2\xbb'
>>> transcode.to_unicode(utf8_text)
'« Sacré-Cœur »'
>>> source_text_2 = 'äöü'
>>> latin_text = transcode.to_bytes(source_text_2, to_encoding='iso-8859-1')
>>> latin_text
>>> utf_text_2 = transcode.to_bytes(source_text_2)
>>> utf_text_2
>>> # Encoding auto-detection (falls back to cp1252 for non-UTF8 encoded
>>> # byte sequences without a byte order mark)
>>> transcode.to_unicode_and_encoding_name(latin_text)
('äöü', 'cp1252')
>>> transcode.to_unicode_and_encoding_name(utf_text_2)
('äöü', 'utf-8')

(smallparts docs home)