pagexml.parser module

pagexml.parser.json_to_column_container(json_doc: dict) → tuple[source]

pagexml.parser.json_to_pagexml_column(json_doc: dict) → PageXMLColumn[source]

pagexml.parser.json_to_pagexml_doc(json_doc: dict) → PageXMLDoc[source]

pagexml.parser.json_to_pagexml_line(json_doc: dict) → PageXMLTextLine[source]

pagexml.parser.json_to_pagexml_page(json_doc: dict) → PageXMLPage[source]

pagexml.parser.json_to_pagexml_scan(json_doc: dict) → PageXMLScan[source]

pagexml.parser.json_to_pagexml_text_region(json_doc: dict) → PageXMLTextRegion[source]

pagexml.parser.json_to_pagexml_word(json_doc: dict) → PageXMLWord[source]

pagexml.parser.parse_baseline(baseline: dict) → Baseline[source]

pagexml.parser.parse_conf(text_element: dict) → float | None[source]

pagexml.parser.parse_coords(coords: dict) → Coords | None[source]

pagexml.parser.parse_custom_metadata(text_element: Dict[str, any], custom_tags: Iterable = []) → Dict[str, any][source]: Parse custom metadata, like readingOrder, structure.

pagexml.parser.parse_custom_metadata_element(custom_string: str, custom_field: str) → Dict[str, str][source]

pagexml.parser.parse_custom_metadata_element_list(custom_string: str, custom_field: str) → List[Dict[str, str]][source]

pagexml.parser.parse_line_words(textline: dict) → List[PageXMLWord][source]

pagexml.parser.parse_page_image_size(page_json: dict) → Coords[source]

pagexml.parser.parse_page_metadata(metadata_json: dict) → dict[source]

pagexml.parser.parse_page_reading_order(page_json: dict) → dict[source]

pagexml.parser.parse_pagexml_file(pagexml_file: str, pagexml_data: str | None = None, custom_tags: Iterable = {}, encoding: str = 'utf-8') → PageXMLScan[source]

Read PageXML from file (or content of file passed separately if read from elsewhere, e.g. tarball) and return a PageXMLScan object.

Parameters:

pagexml_file (str) – filepath to a PageXML file
pagexml_data (str) – string representation of PageXML document (corresponding to the content of pagexml_file)
custom_tags (list) – list of custom tags to be parsed in the metadata
encoding (str) – the encoding of the file (default utf-8)

Returns:

a pdm.PageXMLScan object

Return type:

PageXMLScan

pagexml.parser.parse_pagexml_files(pagexml_files: List[str], ignore_errors: bool = False, encoding: str = 'utf-8') → Generator[PageXMLScan, None, None][source]: Parse a list of PageXML files and return each as a PageXMLScan object.

pagexml.parser.parse_pagexml_files_from_archive(archive_file: str, ignore_errors: bool = False, silent_mode: bool = False, encoding: str = 'utf-8') → Generator[PageXMLScan, None, None][source]

Parse a list of PageXML files from an archive (e.g. zip, tar) and return each PageXML file as a PageXMLScan object.

Parameters:

archive_file (str) – filepath of a archive (zip, tar) containing PageXML files
ignore_errors (bool) – whether to ignore errors when parsing individual PageXML files
ignore_errors – whether to ignore errors warnings when parsing individual PageXML files
encoding (str) – the encoding of the file (default utf-8)

Returns:

a PageXMLScan object

Return type:

PageXMLScan

pagexml.parser.parse_pagexml_files_from_directory(pagexml_directories: List[str], show_progress: bool = False) → Generator[PageXMLScan, None, None][source]

Parse PageXML files from one or more directories.

Parameters:

pagexml_directories (List[str]) – the name of one or more directories containing uncompressed PageXML files
show_progress (bool) – flag to determine whether a TQDM progress bar is shown

Returns:

a generator that yields a tuple of archived file name and content

Return type:

Generator[Tuple[str, str], None, None]

pagexml.parser.parse_pagexml_from_json(pagexml_json: str | Dict[str, any]) → PageXMLDoc[source]: Turn a JSON representation of a PageXML document into an instance from the physical document model.

pagexml.parser.parse_pagexml_json(pagexml_file: str, scan_json: dict, custom_tags: Iterable = []) → PageXMLScan[source]: Parse a JSON/xmltodict representation of a PageXML file and return a PageXMLScan object.

pagexml.parser.parse_text_equiv(text_equiv: dict) → str | None[source]

pagexml.parser.parse_textline(textline: dict, custom_tags: Iterable = []) → PageXMLTextLine[source]

pagexml.parser.parse_textline_list(textline_list: list, custom_tags: Iterable = []) → List[PageXMLTextLine][source]

pagexml.parser.parse_textregion(text_region_dict: dict, custom_tags: Iterable = []) → PageXMLTextRegion | None[source]

pagexml.parser.parse_textregion_list(textregion_dict_list: list, custom_tags: Iterable = []) → List[PageXMLTextRegion][source]

pagexml.parser.read_pagexml_dirs(pagexml_dirs: str | List[str]) → List[str][source]

Return a list of all (Page)XML files within a list of directories.

Parameters:: pagexml_dirs (Union[str, List[str]]) – a list of directories containing PageXML files.

pagexml.parser.read_pagexml_file(pagexml_file: str, encoding: str = 'utf-8') → str[source]: Return the content of a PageXML file as text string.