pagexml.analysis.stats module
- pagexml.analysis.stats.derive_boundary_points(pagexml_doc: PageXMLTextRegion) List[int][source]
- pagexml.analysis.stats.get_doc_stats(pagexml_docs: PageXMLTextRegion | List[PageXMLTextRegion], line_width_boundary_points: List[int] = None, stop_words: List[str] = None, max_word_length: int = 30, doc_num: int = None, use_re_word_boundaries: bool = False, line_bin_width: int = 300, max_bin: int = 3000) Dict[str, List[any]][source]
Generate basic statistics for a PageXML scan object (number of text regions, lines, words, etc.).
Line widths are categorised based on a list of boundary points that determine the width of each bin. If no boundary points are passed, a set of boundary points is generated based on the width of the pagexml_doc.
- Parameters:
pagexml_docs (PageXMLTextRegion) – a PageXML document object or a list of PageXML document objects
line_width_boundary_points (List[int]) – a list of points indicating boundaries between categories of line widths
stop_words (List[str],) – a list of stopwords to include in number of stopwords the scan statistics
max_word_length (int) – max word length above which words are considered oversized
doc_num (int) – the number of a doc in a sequence of docs
use_re_word_boundaries (bool) – flag whether to use RegEx word boundaries for word count
line_bin_width (int) – width of line bins, to aggregate lines of different lengths
max_bin (int) – max line width bin
- Returns:
a dictionary with scan statistics
- Return type:
Dict[str, int]