pagexml.analysis.layout_stats module
- pagexml.analysis.layout_stats.average_baseline_height(line: PageXMLTextLine | List[PageXMLTextLine]) int[source]
Compute the average (mean) baseline height for comparing lines that are not horizontally aligned.
- Parameters:
line (PageXMLTextLine) – a TextLine or a list of adjacent lines
- Returns:
the average (mean) baseline height across all its baseline points
- Return type:
int
- pagexml.analysis.layout_stats.categorise_line_width(line: PageXMLTextLine, boundary_points: List[int]) str[source]
Categorise a line based on its width and a list of line width boundary points.
- pagexml.analysis.layout_stats.compute_baseline_distances(line1: PageXMLTextLine | List[PageXMLTextLine], line2: PageXMLTextLine | List[PageXMLTextLine], step: int = 50) ndarray[source]
Compute the vertical distance between two baselines, based on their horizontal overlap, using a fixed step size. Interpolated points will be generated at fixed increments of step size for both baselines, so they have points with corresponding x coordinates to calculate the distance.
If two lines have no horizontal overlap, it returns a list with a single distance between the average heights of the two baselines
- Parameters:
line1 (PageXMLTextLine) – the first line (or list of adjacent lines) in the comparison
line2 (PageXMLTextLine) – the second line (or list of adjacent lines) in the comparison
step (int) – the step size in pixels for interpolation
- Returns:
a list of vertical distances based on horizontal overlap
- Return type:
List[int]
- pagexml.analysis.layout_stats.compute_bounding_box_distances(line1: PageXMLTextLine | List[PageXMLTextLine], line2: PageXMLTextLine | List[PageXMLTextLine], step: int = 50)[source]
- pagexml.analysis.layout_stats.compute_columns_stats(columns: List[PageXMLColumn], stats: Dict[str, Dict[str, Counter]])[source]
- pagexml.analysis.layout_stats.compute_lines_stats(lines: List[PageXMLTextLine], stats: Dict[str, Dict[str, Counter]]) None[source]
- pagexml.analysis.layout_stats.compute_pages_stats(pages: List[PageXMLPage], stats: Dict[str, Dict[str, Counter]])[source]
- pagexml.analysis.layout_stats.compute_pagexml_stats(docs: List[PageXMLDoc]) Dict[str, Dict[str, Counter]][source]
Compute statistics on the numbers of PageXML elements that are part of a given list of PageXMLDoc objects.
- Parameters:
docs (List[PageXMLDoc]) – a list of PageXMLDoc objects
- Returns:
A nested dictionary of statistic per PageXML element type
- Return type:
Dict[str, Dict[str, Counter]]
- pagexml.analysis.layout_stats.compute_points_distances(points1: List[Tuple[int, int]], points2: List[Tuple[int, int]], step: int = 50)[source]
- pagexml.analysis.layout_stats.compute_scans_stats(scans: List[PageXMLScan], stats: Dict[str, Dict[str, Counter]])[source]
- pagexml.analysis.layout_stats.compute_textregion_distance(tr1: PageXMLTextRegion, tr2: PageXMLTextRegion) int | float[source]
- pagexml.analysis.layout_stats.compute_textregions_stats(text_regions: List[PageXMLTextRegion], stats: Dict[str, Dict[str, Counter]]) None[source]
- pagexml.analysis.layout_stats.find_line_width_boundary_points(line_widths: List[int], line_bin_size: int = 50, min_ratio: float = 0.25) List[int][source]
Find the minima in the distribution of line widths relative to the peaks in the distribution. These minima represent the boundaries between clusters of lines within the same line width intervals.
- Parameters:
line_widths (List[int]) – a list of PageXML text line widths
line_bin_size (int) – the bin size for grouping lines to establish the line width distribution (default 50 pixels)
min_ratio (float) – the minimum ratio between a peak frequency and its neighbouring minimum to determine if the minimum is a category boundary
- Returns:
A list of category boundary points
- Return type:
List[int]
- pagexml.analysis.layout_stats.find_lowest_point(line: PageXMLTextLine) Tuple[int, int][source]
Find the first baseline point that corresponds to the lowest vertical point.
- Parameters:
line (PageXMLTextLine) – a PageXML TextLine object with baseline information
- Returns:
the left most point that has the lowest vertical coordinate
- Return type:
Tuple[int, int]
- pagexml.analysis.layout_stats.get_baseline_y(line: PageXMLTextLine) List[int][source]
Return the Y/vertical coordinates of a text line’s baseline.
- pagexml.analysis.layout_stats.get_bottom_points(line: PageXMLTextLine) List[Tuple[int, int]][source]
- pagexml.analysis.layout_stats.get_boundary_width_ranges(boundary_points: List[int]) List[str][source]
- pagexml.analysis.layout_stats.get_line_distances(lines: List[PageXMLTextLine]) List[ndarray][source]
- pagexml.analysis.layout_stats.get_line_height_stats(line: PageXMLTextLine, step: int = 50, ignore_errors: bool = False, debug: int = 0) Dict[str, int] | None[source]
- pagexml.analysis.layout_stats.get_line_width_stats(lines: List[PageXMLTextLine], boundary_points: List[int]) Counter[source]
Return a Counter object with statistics of the number of lines categorised according to a list of category break points (line widths that are the boundary between categories of line width).
- Parameters:
lines (List[PageXMLTextLine]) – A list of PageXML text lines
boundary_points (List[int]) – A list of line width category boundary points
- Returns:
A counter with the number of lines per line width interval
- Return type:
Counter
- pagexml.analysis.layout_stats.get_line_widths(pagexml_files: List[str | PageXMLTextRegion] = None, line_width_bin_size: int = 50) List[int][source]
Return a list of line widths for the lines in a list of PageXML files.
- Parameters:
pagexml_files (List[str]) – a list of PageXML filepaths
line_width_bin_size (int) – the bin size for grouping lines (default is 50 pixels)
- Returns:
a list of line widths
- Return type:
List[int]
- pagexml.analysis.layout_stats.get_text_heights(line: PageXMLTextLine, step: int = 50, ignore_errors: bool = True, debug: int = 0) array[source]
- pagexml.analysis.layout_stats.get_textregion_avg_char_width(text_region: PageXMLTextRegion) float[source]
Return the estimated average (mean) character width, determined as the sum of the width of text lines divided by the sum of the number of characters of all text lines.
- Parameters:
text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines
- Returns:
the average (mean) character width
- Return type:
float
- pagexml.analysis.layout_stats.get_textregion_avg_line_distance(text_region: PageXMLTextRegion, avg_type: str = 'macro') float[source]
Returns the median distance between subsequent lines in a textregion object. If the textregion contains smaller textregions, it only considers line distances between lines within the same column (i.e. only lines from textregions that are horizontally aligned.)
By default, the macro-average is returned.
- Parameters:
text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines
avg_type (str) – the type of averging to apply (macro or micro)
- Returns:
the median distance between horizontally aligned lines
- Return type:
float
- pagexml.analysis.layout_stats.get_textregion_avg_line_width(text_region: PageXMLTextRegion, unit: str = 'char') float[source]
Return the estimated average (mean) character width, determined as the sum of the width of text lines divided by the sum of the number of characters of all text lines.
- Parameters:
text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines
unit (str) – the unit to measure line width, either char or pixel
- Returns:
the average (mean) character width
- Return type:
float
- pagexml.analysis.layout_stats.get_textregion_line_distances(text_region: PageXMLTextRegion) List[ndarray][source]
Returns a list of line distance numpy arrays. For each line, its distance to the next at 50 pixel intervals is computed and stored in a numpy ndarray.
- Parameters:
text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines
- Returns:
a list of numpy ndarrays of line distances
- Return type:
List[np.ndarray]
- pagexml.analysis.layout_stats.interpolate_baseline_points(points: List[Tuple[int, int]], step: int = 50) Dict[int, int][source]
Determine the x coordinates between each pair of subsequent points on a baseline and calculate their corresponding y coordinates.
- Parameters:
points (List[Tuple[int, int]]) – the list of points of a baseline object
step (int) – the step size in pixels for interpolation
- Returns:
a dictionary of interpolated points based on step size
- Return type:
Dict[int, int]
- pagexml.analysis.layout_stats.interpolate_points(p1: Tuple[int, int], p2: Tuple[int, int], step: int = 50) Generator[Dict[int, int], None, None][source]
Determine the x coordinates between a pair of points on a baseline and calculate their corresponding y coordinates.
- Parameters:
p1 (Tuple[int, int]) – a 2D point
p2 (Tuple[int, int]) – a 2D point
step (int) – the step size in pixels for interpolation
- Returns:
a generator of interpolated points based on step size
- Return type:
Generator[Dict[int, int], None, None]
- pagexml.analysis.layout_stats.line_starts_with_big_capital(line: PageXMLTextLine) bool[source]
Determine if a line starts with a capital in a larger font than the rest, which is aligned at the top, so sticks out at the bottom.
- pagexml.analysis.layout_stats.sort_coords_above_below_baseline(line: PageXMLTextLine, debug: int = 0) Tuple[List[Tuple[int, int]], List[Tuple[int, int]]][source]
Split the list of bounding polygon coordinates of a line in sets of points above and below the baseline. When a line has no baseline or no bounding polygon, empty lists are returned
- Parameters:
line (PageXMLTextLine) – a PageXML text line
debug (int) – the detail level of debug information (0 = none, higher is more)
- Returns:
two lists of bounding polygon points
- Return type:
tuple