pagexml.helper.file_helper module

class pagexml.helper.file_helper.Extractor(page_archive_file: str, filenames_only: bool = False)[source]

Bases: object

pagexml.helper.file_helper.get_archive_functions(archiver: str)[source]
pagexml.helper.file_helper.get_archived_file_names(archive_handle: TarFile | ZipFile) List[str][source]
pagexml.helper.file_helper.get_archived_files_infos(archive_handle: TarFile | ZipFile) List[ZipInfo | TarInfo][source]
pagexml.helper.file_helper.get_archiver_mode(page_archive_file: str) Tuple[Literal['tar', 'zip', 'py7zr'], Literal['r', 'r:', 'r:gz', 'r:bz2']][source]
pagexml.helper.file_helper.parse_archived_filename(archived_fname: str) Tuple[str, str, str][source]

Split the full pathname of an archive file into directory, file base name and extension.

pagexml.helper.file_helper.read_7z_handle(page_7z_file: str, page_7z_handle: SevenZipFile, filenames_only: bool = False) Generator[Tuple[dict, str | None], None, None][source]
pagexml.helper.file_helper.read_inner_archive(archived_filename: str, archived_file_ext: str, archived_file_handle: IO[bytes] | bytes, file_info: Dict[str, any], filenames_only: bool = False)[source]
pagexml.helper.file_helper.read_page_7z_file(page_7z_file: str, filenames_only: bool = False) Generator[Tuple[dict, str | None], None, None][source]
pagexml.helper.file_helper.read_page_archive_file(page_archive_file: str, filenames_only: bool = False, show_progress: bool = False) Generator[Tuple[dict, str | None], None, None][source]

Read PageXML files from an archive file (e.g. zip, tar or 7z).

Parameters:
  • page_archive_file (str) – the name of the archive file

  • filenames_only (bool) – whether to return only the archived filenames or also the content (default is False)

  • show_progress (bool) – whether a TQDM progress bar is shown (default is False)

Returns:

a generator that yields a tuple of archived file name and content

Return type:

Generator[Tuple[str, str], None, None]

pagexml.helper.file_helper.read_page_archive_files(page_archive_files: List[str], filenames_only: bool = False, show_progress: bool = False) Generator[Tuple[dict, str | None], None, None][source]

Read PageXML files from a list of archive file (e.g. zip, tar or 7z).

Parameters:
  • page_archive_files (List[str]) – the name of the archive file

  • filenames_only (bool) – whether to return only the archived filenames or also the content

  • show_progress (bool) – flag to determine whether a TQDM progress bar is shown

Returns:

a generator that yields a tuple of archived file name and content

Return type:

Generator[Tuple[str, str], None, None]

pagexml.helper.file_helper.read_tar_handle(archive_fname: str, archive_handle: TarFile, filenames_only: bool = False)[source]
pagexml.helper.file_helper.read_zip_handle(archive_fname: str, archive_handle: ZipFile, filenames_only: bool = False) Generator[Tuple[dict, str | None], None, None][source]