pagexml.helper.text_helper module
- class pagexml.helper.text_helper.LineReader(pagexml_files: str | List[str] = None, pagexml_docs: PageXMLDoc | List[PageXMLDoc] = None, pagexml_line_files: str | List[str] = None, line_file_headers: List[str] = None, has_headers: bool = True, use_outer_textregions: bool = False, add_bounding_box: bool = False, groupby: str = None)[source]
Bases:
Iterable
- pagexml.helper.text_helper.find_term_in_context(term: str, line_reader: LineReader, max_hits: int = -1, context_size: int = 3, ignorecase: bool = True) Generator[str, None, None] | None[source]
Find a term and its context in text lines from a line reader iterable. The term can include wildcard symbol at either the start or end of the term, or both.
- Parameters:
term – a term to find in a list of lines
line_reader (LineReader) – an iterable for a list of lines
max_hits (int) – the maximum number of term matches to return
context_size (int) – the number of words before and after each term to return as context
ignorecase (bool) – flag to indicate whether case should be ignored
- Type:
str
- Returns:
a generator yield occurrences of the term with its context
- Type:
Generator[str, None, None]
- pagexml.helper.text_helper.get_bbox(doc: PageXMLDoc)[source]
- pagexml.helper.text_helper.get_line_format_json(page_doc: PageXMLTextRegion, use_outer_textregions: bool = False, add_bounding_box: bool = False) Generator[Dict[str, any], None, None][source]
- pagexml.helper.text_helper.get_line_format_tsv(page_doc: PageXMLTextRegion, headers: List[str], use_outer_textregions: bool = False, add_bounding_box: bool = False) Generator[List[str], None, None][source]
- pagexml.helper.text_helper.get_line_words(line: PageXMLTextLine | str, word_break_chars: str | Set[str] = '-') List[str][source]
Return a list of the words for a given line.
- Parameters:
line (Union[str, PageXMLTextline]) – a line of text (string or PageXMLTextline)
word_break_chars (str) – a string of one or more line break characters
- Returns:
a list of words
- Return type:
List[str]
- pagexml.helper.text_helper.get_page_lines_words(page: PageXMLPage, word_break_chars='-') Generator[List[str], None, None][source]
Return a generator object yielding lists of words per line of a PageXML Page.
- Parameters:
page (PageXMLPage) – a PageXML page object
word_break_chars (str) – a string of one or more line break characters
- Returns:
a generator object yielding a list of words per page line
- Return type:
Generator[List[str], None, None]
- pagexml.helper.text_helper.make_line_format_file(page_docs: Iterable[PageXMLTextRegion], line_format_file: str, headers: List[str] = None, use_outer_textregions: bool = False, add_bounding_box: bool = False)[source]
Create a line format file for a list of PageXMLDoc objects.
- pagexml.helper.text_helper.make_skipgram_similarity_dict(line_reader: LineReader, ngram_length: int = 2, skip_length: int = 1) SkipgramSimilarity[source]
- pagexml.helper.text_helper.read_lines_from_line_files(pagexml_line_files: str | List[str], has_headers: bool = True) Generator[str, None, None][source]
- pagexml.helper.text_helper.read_pagexml_docs_from_line_file(line_files: str | List[str], has_headers: bool = True, headers: List[str] = None, add_bounding_box: bool = True) Generator[PageXMLTextRegion, None, None][source]
Read lines from one or more PageXML line format files and return them as PageXMLTextLine objects, grouped by their PageXML document.
- pagexml.helper.text_helper.remove_word_break_chars(end_word: str, start_word: str, word_break_chars='-=:') str[source]