pagexml.helper.pagexml_helper module

class pagexml.helper.pagexml_helper.LineIterable(line_format_files: str | List[str], headers: List[str] = None)[source]

Bases: object

pagexml.helper.pagexml_helper.combine_adjacent_lines(lines: List[PageXMLTextLine], reading_direction: str, avg_char_width: float)[source]
pagexml.helper.pagexml_helper.elements_overlap(element1: PageXMLDoc, element2: PageXMLDoc, threshold: float = 0.5) bool[source]

Check if two elements have overlapping coordinates.

pagexml.helper.pagexml_helper.get_custom_tags(doc: PageXMLDoc) List[Dict[str, any]][source]

Get all custom tags and their textual values from a PageXMLDoc.

This function assumes that the PageXML document is generated with input of some custom_tags in the parse_pagexml_file function. This helper retrieves those tags from all TextLines and finds the corresponding text from their offset and length. It returns a dictionary with the tag type, the textual value, region and line id, and the offset and length.

Parameters:

doc (pdm.PageXMLDoc) – A PageXMLDoc

Returns:

List of custom tags

Return type:

List[Dict[str, any]]

pagexml.helper.pagexml_helper.horizontal_group_lines(lines: List[PageXMLTextLine]) List[List[PageXMLTextLine]][source]

Sort lines of a text region vertically as a list of lists, with adjacent lines grouped in inner lists.

pagexml.helper.pagexml_helper.horizontally_merge_lines(lines: List[PageXMLTextLine]) List[PageXMLTextLine][source]

Sort lines vertically and merge horizontally adjacent lines.

pagexml.helper.pagexml_helper.line_ends_with_word_break(curr_line: PageXMLTextLine, next_line: PageXMLTextLine, word_freq: Counter = None) bool[source]
pagexml.helper.pagexml_helper.make_line_range(text: str, line: PageXMLTextLine, line_text: str) Dict[str, any][source]
pagexml.helper.pagexml_helper.make_line_text(line: PageXMLTextLine, do_merge: bool, end_word: str, merge_word: str, word_break_chars: str | Set[str] | List[str] = '-') str[source]
pagexml.helper.pagexml_helper.make_text_region_text(lines: List[PageXMLTextLine], word_break_chars: str | Set[str] | List[str] = '-', wbd: WordBreakDetector = None) Tuple[str | None, List[Dict[str, any]]][source]

Turn the text lines in a region into a single paragraph of text, with a list of line ranges that indicates how the text of each line corresponds to character offsets in the paragraph.

Parameters:
  • lines (List[PageXMLTextLine]) – a list of PageXML text lines belonging to the same text region

  • word_break_chars (List[str]) – a lsit of characters that signal a word-break

  • wbd (LineBreakDetector) – a line break detector object

Returns:

a paragraph of text and a list of line ranges that indicates how the text of each line corresponds to character offsets in the paragraph.

Return type:

Tuple[str, List[Dict[str, any]]

pagexml.helper.pagexml_helper.merge_lines(lines: List[PageXMLTextLine], remove_word_break: bool = False, word_break_char: str = '-') PageXMLTextLine[source]

Returns a PageXMLTextline object that is the merge of a list of PageXMLTextlines.

Parameters:
  • lines (List[PageXMLTextline]) – a list of PageXML text lines

  • remove_word_break (bool) – flag indicating whether line break characters should be removed

  • word_break_char (str) – the character that is used as a line break

Returns:

a PageXML text line object

Return type:

PageXMLTextline

pagexml.helper.pagexml_helper.merge_sets(sets: List[Set[any]], min_overlap: int = 1) List[Set[any]][source]
pagexml.helper.pagexml_helper.merge_textregions(text_regions: List[PageXMLTextRegion], metadata: dict = None, doc_id: str = None) PageXMLTextRegion | None[source]

Merge two text_regions into one, sorting lines by baseline height.

pagexml.helper.pagexml_helper.pagexml_to_line_format(pagexml_doc: PageXMLTextRegion) Generator[Tuple[str, str, str], None, None][source]
pagexml.helper.pagexml_helper.pretty_print_textregion(text_region: PageXMLTextRegion, reading_direction: str = 'ltr', print_stats: bool = False) None[source]

Pretty print the text of a text region, using indentation and vertical space based on the average character width and average distance between lines. If no corresponding images of the PageXML are available, this can serve as a visual approximation to reveal the page layout.

Parameters:
  • text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines

  • reading_direction – option to set reading direction left-to-right (default) or right-to-left

  • print_stats (bool) – flag to print text_region statistics if set to True

pagexml.helper.pagexml_helper.print_textregion_stats(text_region: PageXMLTextRegion) None[source]

Print statistics on the textual content of a text region.

Parameters:

text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines

pagexml.helper.pagexml_helper.read_line_format_file(line_format_files: str | List[str], headers: List[str] = None, has_header: bool = False) Generator[Tuple[str, str, str], None, None][source]
pagexml.helper.pagexml_helper.sort_lines_in_column_reading_order(doc: PageXMLDoc, reading_direction: str = 'ltr') Generator[PageXMLTextLine, None, None][source]

Sort the lines of a pdm.PageXML document in reading order. Reading order is: columns from left to right, text regions in columns from top to bottom, lines in text regions from top to bottom, and when (roughly) adjacent, from left to right.

pagexml.helper.pagexml_helper.sort_lines_in_reading_direction(lines: List[PageXMLTextLine], reading_direction: str = 'ltr') Generator[PageXMLTextLine, None, None][source]
pagexml.helper.pagexml_helper.sort_lines_in_reading_order(doc: PageXMLTextRegion, row_order: bool = False, reading_direction: str = 'ltr') Generator[PageXMLTextLine, None, None][source]
pagexml.helper.pagexml_helper.sort_lines_in_row_reading_order(doc: PageXMLTextRegion, reading_direction: str = 'ltr') Generator[PageXMLTextLine, None, None][source]

Sort the lines of a pdm.PageXML document in row order. Row order is: lines from top to bottom, and when (roughly) adjacent, in the given reading direction.

pagexml.helper.pagexml_helper.sort_regions_in_reading_order(doc: PageXMLDoc) List[PageXMLTextRegion][source]

Sort text regions in reading order. If an explicit reading order is given, that is used, otherwise, text regions are sorted top to bottom, left to right.

pagexml.helper.pagexml_helper.write_pagexml_to_line_format(pagexml_docs: List[PageXMLTextRegion], output_file: str) None[source]