pagexml.helper.text_helper module

class pagexml.helper.text_helper.LineReader(pagexml_files: str | List[str] = None, pagexml_docs: PageXMLDoc | List[PageXMLDoc] = None, pagexml_line_files: str | List[str] = None, line_file_headers: List[str] = None, has_headers: bool = True, use_outer_textregions: bool = False, add_bounding_box: bool = False, groupby: str = None)[source]: Bases: Iterable

pagexml.helper.text_helper.find_term_in_context(term: str, line_reader: LineReader, max_hits: int = -1, context_size: int = 3, ignorecase: bool = True) → Generator[str, None, None] | None[source]

Find a term and its context in text lines from a line reader iterable. The term can include wildcard symbol at either the start or end of the term, or both.

Parameters:

term – a term to find in a list of lines
line_reader (LineReader) – an iterable for a list of lines
max_hits (int) – the maximum number of term matches to return
context_size (int) – the number of words before and after each term to return as context
ignorecase (bool) – flag to indicate whether case should be ignored

Type:

str

Returns:

a generator yield occurrences of the term with its context

Type:

Generator[str, None, None]

pagexml.helper.text_helper.get_bbox(doc: PageXMLDoc)[source]

pagexml.helper.text_helper.get_line_format_json(page_doc: PageXMLTextRegion, use_outer_textregions: bool = False, add_bounding_box: bool = False) → Generator[Dict[str, any], None, None][source]

pagexml.helper.text_helper.get_line_format_tsv(page_doc: PageXMLTextRegion, headers: List[str], use_outer_textregions: bool = False, add_bounding_box: bool = False) → Generator[List[str], None, None][source]

pagexml.helper.text_helper.get_line_words(line: PageXMLTextLine | str, word_break_chars: str | Set[str] = '-') → List[str][source]

Return a list of the words for a given line.

Parameters:

line (Union[str, PageXMLTextline]) – a line of text (string or PageXMLTextline)
word_break_chars (str) – a string of one or more line break characters

Returns:

a list of words

Return type:

List[str]

pagexml.helper.text_helper.get_page_lines_words(page: PageXMLPage, word_break_chars='-') → Generator[List[str], None, None][source]

Return a generator object yielding lists of words per line of a PageXML Page.

Parameters:

page (PageXMLPage) – a PageXML page object
word_break_chars (str) – a string of one or more line break characters

Returns:

a generator object yielding a list of words per page line

Return type:

Generator[List[str], None, None]

pagexml.helper.text_helper.make_line_format_file(page_docs: Iterable[PageXMLTextRegion], line_format_file: str, headers: List[str] = None, use_outer_textregions: bool = False, add_bounding_box: bool = False)[source]: Create a line format file for a list of PageXMLDoc objects.

pagexml.helper.text_helper.make_list(var) → list[source]

pagexml.helper.text_helper.make_skipgram_similarity_dict(line_reader: LineReader, ngram_length: int = 2, skip_length: int = 1) → SkipgramSimilarity[source]

pagexml.helper.text_helper.read_lines_from_line_files(pagexml_line_files: str | List[str], has_headers: bool = True) → Generator[str, None, None][source]

pagexml.helper.text_helper.read_pagexml_docs_from_line_file(line_files: str | List[str], has_headers: bool = True, headers: List[str] = None, add_bounding_box: bool = True) → Generator[PageXMLTextRegion, None, None][source]: Read lines from one or more PageXML line format files and return them as PageXMLTextLine objects, grouped by their PageXML document.

pagexml.helper.text_helper.remove_hyphen(word: str) → str[source]

pagexml.helper.text_helper.remove_word_break_chars(end_word: str, start_word: str, word_break_chars='-=:') → str[source]

pagexml.helper.text_helper.split_line_words(words: List[str]) → Tuple[List[str], List[str], List[str]][source]

pagexml.helper.text_helper.transform_box_to_coords(box_string: str) → Coords[source]