pagexml.analysis.stats module

pagexml.analysis.stats.derive_boundary_points(pagexml_doc: PageXMLTextRegion) → List[int][source]

pagexml.analysis.stats.get_doc_stats(pagexml_docs: PageXMLTextRegion | List[PageXMLTextRegion], line_width_boundary_points: List[int] = None, stop_words: List[str] = None, max_word_length: int = 30, doc_num: int = None, use_re_word_boundaries: bool = False, line_bin_width: int = 300, max_bin: int = 3000) → Dict[str, List[any]][source]

Generate basic statistics for a PageXML scan object (number of text regions, lines, words, etc.).

Line widths are categorised based on a list of boundary points that determine the width of each bin. If no boundary points are passed, a set of boundary points is generated based on the width of the pagexml_doc.

Parameters:

pagexml_docs (PageXMLTextRegion) – a PageXML document object or a list of PageXML document objects
line_width_boundary_points (List[int]) – a list of points indicating boundaries between categories of line widths
stop_words (List[str],) – a list of stopwords to include in number of stopwords the scan statistics
max_word_length (int) – max word length above which words are considered oversized
doc_num (int) – the number of a doc in a sequence of docs
use_re_word_boundaries (bool) – flag whether to use RegEx word boundaries for word count
line_bin_width (int) – width of line bins, to aggregate lines of different lengths
max_bin (int) – max line width bin

Returns:

a dictionary with scan statistics

Return type:

Dict[str, int]