pagexml.helper.file_helper module
- class pagexml.helper.file_helper.Extractor(page_archive_file: str, filenames_only: bool = False)[source]
Bases:
object
- pagexml.helper.file_helper.get_archived_file_names(archive_handle: TarFile | ZipFile) List[str][source]
- pagexml.helper.file_helper.get_archived_files_infos(archive_handle: TarFile | ZipFile) List[ZipInfo | TarInfo][source]
- pagexml.helper.file_helper.get_archiver_mode(page_archive_file: str) Tuple[Literal['tar', 'zip', 'py7zr'], Literal['r', 'r:', 'r:gz', 'r:bz2']][source]
- pagexml.helper.file_helper.parse_archived_filename(archived_fname: str) Tuple[str, str, str][source]
Split the full pathname of an archive file into directory, file base name and extension.
- pagexml.helper.file_helper.read_7z_handle(page_7z_file: str, page_7z_handle: SevenZipFile, filenames_only: bool = False) Generator[Tuple[dict, str | None], None, None][source]
- pagexml.helper.file_helper.read_inner_archive(archived_filename: str, archived_file_ext: str, archived_file_handle: IO[bytes] | bytes, file_info: Dict[str, any], filenames_only: bool = False)[source]
- pagexml.helper.file_helper.read_page_7z_file(page_7z_file: str, filenames_only: bool = False) Generator[Tuple[dict, str | None], None, None][source]
- pagexml.helper.file_helper.read_page_archive_file(page_archive_file: str, filenames_only: bool = False, show_progress: bool = False) Generator[Tuple[dict, str | None], None, None][source]
Read PageXML files from an archive file (e.g. zip, tar or 7z).
- Parameters:
page_archive_file (str) – the name of the archive file
filenames_only (bool) – whether to return only the archived filenames or also the content (default is False)
show_progress (bool) – whether a TQDM progress bar is shown (default is False)
- Returns:
a generator that yields a tuple of archived file name and content
- Return type:
Generator[Tuple[str, str], None, None]
- pagexml.helper.file_helper.read_page_archive_files(page_archive_files: List[str], filenames_only: bool = False, show_progress: bool = False) Generator[Tuple[dict, str | None], None, None][source]
Read PageXML files from a list of archive file (e.g. zip, tar or 7z).
- Parameters:
page_archive_files (List[str]) – the name of the archive file
filenames_only (bool) – whether to return only the archived filenames or also the content
show_progress (bool) – flag to determine whether a TQDM progress bar is shown
- Returns:
a generator that yields a tuple of archived file name and content
- Return type:
Generator[Tuple[str, str], None, None]