pagexml.parser module
- pagexml.parser.json_to_pagexml_column(json_doc: dict) PageXMLColumn[source]
- pagexml.parser.json_to_pagexml_doc(json_doc: dict) PageXMLDoc[source]
- pagexml.parser.json_to_pagexml_line(json_doc: dict) PageXMLTextLine[source]
- pagexml.parser.json_to_pagexml_page(json_doc: dict) PageXMLPage[source]
- pagexml.parser.json_to_pagexml_scan(json_doc: dict) PageXMLScan[source]
- pagexml.parser.json_to_pagexml_text_region(json_doc: dict) PageXMLTextRegion[source]
- pagexml.parser.json_to_pagexml_word(json_doc: dict) PageXMLWord[source]
- pagexml.parser.parse_custom_metadata(text_element: Dict[str, any], custom_tags: Iterable = []) Dict[str, any][source]
Parse custom metadata, like readingOrder, structure.
- pagexml.parser.parse_custom_metadata_element(custom_string: str, custom_field: str) Dict[str, str][source]
- pagexml.parser.parse_custom_metadata_element_list(custom_string: str, custom_field: str) List[Dict[str, str]][source]
- pagexml.parser.parse_line_words(textline: dict) List[PageXMLWord][source]
- pagexml.parser.parse_pagexml_file(pagexml_file: str, pagexml_data: str | None = None, custom_tags: Iterable = {}, encoding: str = 'utf-8') PageXMLScan[source]
Read PageXML from file (or content of file passed separately if read from elsewhere, e.g. tarball) and return a PageXMLScan object.
- Parameters:
pagexml_file (str) – filepath to a PageXML file
pagexml_data (str) – string representation of PageXML document (corresponding to the content of pagexml_file)
custom_tags (list) – list of custom tags to be parsed in the metadata
encoding (str) – the encoding of the file (default utf-8)
- Returns:
a pdm.PageXMLScan object
- Return type:
- pagexml.parser.parse_pagexml_files(pagexml_files: List[str], ignore_errors: bool = False, encoding: str = 'utf-8') Generator[PageXMLScan, None, None][source]
Parse a list of PageXML files and return each as a PageXMLScan object.
- pagexml.parser.parse_pagexml_files_from_archive(archive_file: str, ignore_errors: bool = False, silent_mode: bool = False, encoding: str = 'utf-8') Generator[PageXMLScan, None, None][source]
Parse a list of PageXML files from an archive (e.g. zip, tar) and return each PageXML file as a PageXMLScan object.
- Parameters:
archive_file (str) – filepath of a archive (zip, tar) containing PageXML files
ignore_errors (bool) – whether to ignore errors when parsing individual PageXML files
ignore_errors – whether to ignore errors warnings when parsing individual PageXML files
encoding (str) – the encoding of the file (default utf-8)
- Returns:
a PageXMLScan object
- Return type:
- pagexml.parser.parse_pagexml_files_from_directory(pagexml_directories: List[str], show_progress: bool = False) Generator[PageXMLScan, None, None][source]
Parse PageXML files from one or more directories.
- Parameters:
pagexml_directories (List[str]) – the name of one or more directories containing uncompressed PageXML files
show_progress (bool) – flag to determine whether a TQDM progress bar is shown
- Returns:
a generator that yields a tuple of archived file name and content
- Return type:
Generator[Tuple[str, str], None, None]
- pagexml.parser.parse_pagexml_from_json(pagexml_json: str | Dict[str, any]) PageXMLDoc[source]
Turn a JSON representation of a PageXML document into an instance from the physical document model.
- pagexml.parser.parse_pagexml_json(pagexml_file: str, scan_json: dict, custom_tags: Iterable = []) PageXMLScan[source]
Parse a JSON/xmltodict representation of a PageXML file and return a PageXMLScan object.
- pagexml.parser.parse_textline(textline: dict, custom_tags: Iterable = []) PageXMLTextLine[source]
- pagexml.parser.parse_textline_list(textline_list: list, custom_tags: Iterable = []) List[PageXMLTextLine][source]
- pagexml.parser.parse_textregion(text_region_dict: dict, custom_tags: Iterable = []) PageXMLTextRegion | None[source]
- pagexml.parser.parse_textregion_list(textregion_dict_list: list, custom_tags: Iterable = []) List[PageXMLTextRegion][source]