pagexml.analysis.text_stats module

class pagexml.analysis.text_stats.LineAnalyser(word_break_chars: str | Set[str] = '-', ignorecase: bool = False, token_type: str = None)[source]

Bases: object

analyse_line_chars(text_lines: Iterable[any])[source]: Analyse the frequency of characters at the start, middle and end of a text line, for a given list of text lines.

analyse_line_words(text_lines: Iterable[any])[source]

Gather corpus statistics for a list of text lines on words at the start, middle and end of a text line.

Parameters:: text_lines (Iterable[any]) – an iterable for text lines (either strings or dictionaries with a ‘text’ property

get_stats()[source]: Return statistics on the frequency of characters occuring at the start, middle and end of a text line.

num_tokens()[source]: Returns descriptive statistics of the number of tokens per counter.

num_types()[source]: Returns descriptive statistics of the number of types per counter.

reset_counters()[source]: Reset all the counters.

set_stats()[source]

class pagexml.analysis.text_stats.LineCharAnalyser(text_lines: Iterable[any] = None, word_break_chars: str | Set[str] = '-', ignorecase: bool = False)[source]: Bases: LineAnalyser

class pagexml.analysis.text_stats.LineWordAnalyser(text_lines: Iterable[any] = None, word_break_chars: str | Set[str] = '-', ignorecase: bool = False)[source]

Bases: LineAnalyser

analyse_line_word_categories(text_lines: Iterable[str, pdm.PageXMLTextLine, Dict[str, any]], **kwargs) → Dict[str, Counter][source]

Collect counts on the frequency of different word types, e.g. numbers, title words, stopwords, etc. To get counts on stopwords, a stopword list must be passed. For information on what keyword arguments can be passed, see pagexml.analysis.text_stats.get_word_cat_stats.

Parameters:: text_lines (Iterable[str, PageXMLTextLine, Dict[str, any]) – an iterable with text lines

class pagexml.analysis.text_stats.WordBreakDetector(min_bigram_word_freq: int = 5, word_break_chars: str | Set[str] = '-', ignorecase: bool = False, lines: Iterable = None)[source]

Bases: LineWordAnalyser

print_counter_stats()[source]: Print overall statistics on the vocabulary derived from the analysed text lines.

reset_counters()[source]: Reset all the counters.

set_counters(lines: Iterable[any])[source]

Gather corpus statistics for a list of text lines on words at the start, middle and end of a text line, and on word bigrams in the middle of a line.

Parameters:: lines (Iterable[any]) – an iterable for text lines (either strings or dictionaries with a ‘text’ property

pagexml.analysis.text_stats.compute_complement_keyness(target_analyser: LineAnalyser, target_counter: str)[source]

Compute the keyness score of each token in vocabulary for a given target counter and its complement as the reference counter (available counters are ‘all’, ‘start’, ‘mid’ or ‘end). The complement is the ‘all’ counter minus the target counter.

The return value is a dictionary with two properties, ‘less’ and ‘more’, each with a Counter object. The ‘less’ counter contains the log likelihood ratio for tokens that are less common in the target counter than in the reference counter. The ‘more’ counter contains the log likelihood ratio for tokens that are more common in the target counter than in the reference counter.

Parameters:

target_analyser (LineAnalyser) – the target LineAnalyser
target_counter (str) – the counter used for token frequencies of the target corpus (possible values: ‘all’, ‘start’, ‘mid’ or ‘end’)

pagexml.analysis.text_stats.compute_expected(observed: array) → array[source]: Computes the contingency table of the expected values given a contingency table of the observed values.

pagexml.analysis.text_stats.compute_keyness(target_counter: Counter, reference_counter: Counter, vocab: Iterable[str] = None)[source]

Compute the keyness score of each token in vocabulary for a given target counter and reference counter (available counters are ‘all’, ‘start’, ‘mid’ or ‘end).

The return value is a dictionary with two properties, ‘less’ and ‘more’, each with a Counter object. The ‘less’ counter contains the log likelihood ratio for tokens that are less common in the target counter than in the reference counter. The ‘more’ counter contains the log likelihood ratio for tokens that are more common in the target counter than in the reference counter.

Parameters:

target_counter (str) – the counter used for token frequencies of the target corpus (possible values: ‘all’, ‘start’, ‘mid’ or ‘end’)
reference_counter – the counter used for token frequencies of the reference corpus (possible values: ‘all’, ‘start’, ‘mid’ or ‘end’)
vocab (Iterable[str]) – an optional vocabulary for which to compute keyness values.

pagexml.analysis.text_stats.compute_log_likelihood(token: str, target_counter: Counter, target_total: int, reference_counter: Counter, reference_total: int) → Tuple[float, str][source]

pagexml.analysis.text_stats.determine_word_break(curr_words: List[str], prev_words: List[str], wbd: WordBreakDetector = None, word_break_chars: str | Set[str] = '-', debug: bool = False) → Tuple[bool, str | None][source]

Determine for a current line and previous line (as lists of words) whether the first line ends with a line break.

Parameters:

curr_words (List[str]) – a list of words for the current line to be merged with the previous line
prev_words (List[str]) – a list of words for the previous line to be merged with the current line
wbd (WordBreakDetector) – a line break detector object
word_break_chars (str) – a list of characters that can occur as word breaks.
debug – print debugging information

Returns:

a flag whether the previous line ends in a line break and the merged word composed of the previous line’s last word and current line’s first word (or None if the words should not be merged)

Return type:

Union[str, None]

pagexml.analysis.text_stats.determine_word_break_typical_merge_end(wbd: WordBreakDetector, end_word: str, start_word: str, merge_word: str) → bool[source]

pagexml.analysis.text_stats.end_is_common_word(wbd: WordBreakDetector, end_word: str, common_freq: int = 100, debug: bool = False) → bool[source]

pagexml.analysis.text_stats.end_start_are_bigram(wbd: WordBreakDetector, merge_word: str, bigram_freq: int, factor: int = 5) → bool[source]

pagexml.analysis.text_stats.end_start_are_hyphenated_compound(wbd: WordBreakDetector, end_word: str, start_word: str, merge_word: str) → bool[source]

pagexml.analysis.text_stats.get_doc_words(pagexml_doc: PageXMLTextRegion, use_re_word_boundaries: bool = False) → List[str][source]

Return a list of words that are part of a PageXML pagexml_doc object.

Parameters:

pagexml_doc (PageXMLTextRegion) – a PageXML document object
use_re_word_boundaries (bool) – whether to split words of a line using RegEx word boundaries

Returns:

a list of all words on a pagexml_doc

Return type:

List[str]

pagexml.analysis.text_stats.get_keyness_vocab(target_counter: Counter, reference_counter: Counter) → Set[str][source]

pagexml.analysis.text_stats.get_line_text(text_line: str | Dict[str, any]) → str | None[source]: Convenience function to return the text string of a text line, regardless of whether text_line is a str, or a dictionary or a NoneType.

pagexml.analysis.text_stats.get_observed(token: str, target_counter: Counter, target_total: int, reference_counter: Counter, reference_total: int)[source]: Computes the contingency table of the observed values given a target token, and target and reference analysers and counters.

pagexml.analysis.text_stats.get_typical_start_end_words(wbd: WordBreakDetector, threshold: float = 0.5) → Tuple[Set[str], Set[str]][source]

pagexml.analysis.text_stats.get_word_cat_stats(words, stop_words=None, max_word_length: int = 30, word_length_bin_size: int = 5)[source]

Calculate word type statistics for the word of a given PageXML scan.

Parameters:

words (List[str]) – a list of words on a scan
stop_words (List[str]) – a list of stopwords
max_word_length (int (default 30 characters)) – the maximum length of words to be considered a regular word
word_length_bin_size (int (default per 5 characters)) – bin size for grouping words within a character length interval

pagexml.analysis.text_stats.get_words_per_line(lines: List[PageXMLTextLine], use_re_word_boundaries: bool = False)[source]

Return a Counter of the number of words per line of a PageXML pagexml_doc object.

Parameters:

lines (List[PageXMLTextLine]) – a list of PageXMLTextLine objects
use_re_word_boundaries (bool) – whether to split words of a line using RegEx word boundaries

Returns:

a counter of the number of words per line of a pagexml_doc

Return type:

Counter

pagexml.analysis.text_stats.has_common_merge_end(wbd: WordBreakDetector, end_word: str, start_word: str) → bool[source]

pagexml.analysis.text_stats.has_non_merge_word(wbd: WordBreakDetector, end_word: str, start_word: str, debug: bool = False) → bool[source]

pagexml.analysis.text_stats.has_word_break_symbol(wbd, end_word, start_word, merge_word)[source]

pagexml.analysis.text_stats.is_non_mid_word(wbd: WordBreakDetector, word: str, factor: int = 5) → bool[source]

pagexml.analysis.text_stats.make_line_analyser(token_type: str, word_break_chars, ignorecase: bool = False)[source]

pagexml.analysis.text_stats.merge_analysers(line_analysers: List[LineAnalyser]) → LineAnalyser[source]: Merge a list of LineAnalyser objects into a new, single LineAnalyser.

pagexml.analysis.text_stats.merge_is_more_common(wbd, end_word, start_word, merge_word)[source]

pagexml.analysis.text_stats.show_word_break_context(wbd: WordBreakDetector, end_word: str, start_word: str, merge_word: str, match: str = None)[source]

pagexml.analysis.text_stats.start_is_titleword(start_word: str) → bool[source]

pagexml.analysis.text_stats.start_word_has_incorrect_titlecase(wbd: WordBreakDetector, end_word: str, start_word: str, factor: int = 10) → bool[source]