pagexml.parser module

pagexml.parser.json_to_column_container(json_doc: dict) tuple[source]
pagexml.parser.json_to_pagexml_column(json_doc: dict) PageXMLColumn[source]
pagexml.parser.json_to_pagexml_doc(json_doc: dict) PageXMLDoc[source]
pagexml.parser.json_to_pagexml_line(json_doc: dict) PageXMLTextLine[source]
pagexml.parser.json_to_pagexml_page(json_doc: dict) PageXMLPage[source]
pagexml.parser.json_to_pagexml_scan(json_doc: dict) PageXMLScan[source]
pagexml.parser.json_to_pagexml_text_region(json_doc: dict) PageXMLTextRegion[source]
pagexml.parser.json_to_pagexml_word(json_doc: dict) PageXMLWord[source]
pagexml.parser.parse_baseline(baseline: dict) Baseline[source]
pagexml.parser.parse_conf(text_element: dict) float | None[source]
pagexml.parser.parse_coords(coords: dict) Coords | None[source]
pagexml.parser.parse_custom_metadata(text_element: Dict[str, any], custom_tags: Iterable = []) Dict[str, any][source]

Parse custom metadata, like readingOrder, structure.

pagexml.parser.parse_custom_metadata_element(custom_string: str, custom_field: str) Dict[str, str][source]
pagexml.parser.parse_custom_metadata_element_list(custom_string: str, custom_field: str) List[Dict[str, str]][source]
pagexml.parser.parse_line_words(textline: dict) List[PageXMLWord][source]
pagexml.parser.parse_page_image_size(page_json: dict) Coords[source]
pagexml.parser.parse_page_metadata(metadata_json: dict) dict[source]
pagexml.parser.parse_page_reading_order(page_json: dict) dict[source]
pagexml.parser.parse_pagexml_file(pagexml_file: str, pagexml_data: str | None = None, custom_tags: Iterable = {}, encoding: str = 'utf-8') PageXMLScan[source]

Read PageXML from file (or content of file passed separately if read from elsewhere, e.g. tarball) and return a PageXMLScan object.

Parameters:
  • pagexml_file (str) – filepath to a PageXML file

  • pagexml_data (str) – string representation of PageXML document (corresponding to the content of pagexml_file)

  • custom_tags (list) – list of custom tags to be parsed in the metadata

  • encoding (str) – the encoding of the file (default utf-8)

Returns:

a pdm.PageXMLScan object

Return type:

PageXMLScan

pagexml.parser.parse_pagexml_files(pagexml_files: List[str], ignore_errors: bool = False, encoding: str = 'utf-8') Generator[PageXMLScan, None, None][source]

Parse a list of PageXML files and return each as a PageXMLScan object.

pagexml.parser.parse_pagexml_files_from_archive(archive_file: str, ignore_errors: bool = False, silent_mode: bool = False, encoding: str = 'utf-8') Generator[PageXMLScan, None, None][source]

Parse a list of PageXML files from an archive (e.g. zip, tar) and return each PageXML file as a PageXMLScan object.

Parameters:
  • archive_file (str) – filepath of a archive (zip, tar) containing PageXML files

  • ignore_errors (bool) – whether to ignore errors when parsing individual PageXML files

  • ignore_errors – whether to ignore errors warnings when parsing individual PageXML files

  • encoding (str) – the encoding of the file (default utf-8)

Returns:

a PageXMLScan object

Return type:

PageXMLScan

pagexml.parser.parse_pagexml_files_from_directory(pagexml_directories: List[str], show_progress: bool = False) Generator[PageXMLScan, None, None][source]

Parse PageXML files from one or more directories.

Parameters:
  • pagexml_directories (List[str]) – the name of one or more directories containing uncompressed PageXML files

  • show_progress (bool) – flag to determine whether a TQDM progress bar is shown

Returns:

a generator that yields a tuple of archived file name and content

Return type:

Generator[Tuple[str, str], None, None]

pagexml.parser.parse_pagexml_from_json(pagexml_json: str | Dict[str, any]) PageXMLDoc[source]

Turn a JSON representation of a PageXML document into an instance from the physical document model.

pagexml.parser.parse_pagexml_json(pagexml_file: str, scan_json: dict, custom_tags: Iterable = []) PageXMLScan[source]

Parse a JSON/xmltodict representation of a PageXML file and return a PageXMLScan object.

pagexml.parser.parse_text_equiv(text_equiv: dict) str | None[source]
pagexml.parser.parse_textline(textline: dict, custom_tags: Iterable = []) PageXMLTextLine[source]
pagexml.parser.parse_textline_list(textline_list: list, custom_tags: Iterable = []) List[PageXMLTextLine][source]
pagexml.parser.parse_textregion(text_region_dict: dict, custom_tags: Iterable = []) PageXMLTextRegion | None[source]
pagexml.parser.parse_textregion_list(textregion_dict_list: list, custom_tags: Iterable = []) List[PageXMLTextRegion][source]
pagexml.parser.read_pagexml_dirs(pagexml_dirs: str | List[str]) List[str][source]

Return a list of all (Page)XML files within a list of directories.

Parameters:

pagexml_dirs (Union[str, List[str]]) – a list of directories containing PageXML files.

pagexml.parser.read_pagexml_file(pagexml_file: str, encoding: str = 'utf-8') str[source]

Return the content of a PageXML file as text string.