pagexml.analysis.layout_stats module

pagexml.analysis.layout_stats.average_baseline_height(line: PageXMLTextLine | List[PageXMLTextLine]) int[source]

Compute the average (mean) baseline height for comparing lines that are not horizontally aligned.

Parameters:

line (PageXMLTextLine) – a TextLine or a list of adjacent lines

Returns:

the average (mean) baseline height across all its baseline points

Return type:

int

pagexml.analysis.layout_stats.categorise_line_width(line: PageXMLTextLine, boundary_points: List[int]) str[source]

Categorise a line based on its width and a list of line width boundary points.

pagexml.analysis.layout_stats.compute_baseline_distances(line1: PageXMLTextLine | List[PageXMLTextLine], line2: PageXMLTextLine | List[PageXMLTextLine], step: int = 50) ndarray[source]

Compute the vertical distance between two baselines, based on their horizontal overlap, using a fixed step size. Interpolated points will be generated at fixed increments of step size for both baselines, so they have points with corresponding x coordinates to calculate the distance.

If two lines have no horizontal overlap, it returns a list with a single distance between the average heights of the two baselines

Parameters:
  • line1 (PageXMLTextLine) – the first line (or list of adjacent lines) in the comparison

  • line2 (PageXMLTextLine) – the second line (or list of adjacent lines) in the comparison

  • step (int) – the step size in pixels for interpolation

Returns:

a list of vertical distances based on horizontal overlap

Return type:

List[int]

pagexml.analysis.layout_stats.compute_bounding_box_distances(line1: PageXMLTextLine | List[PageXMLTextLine], line2: PageXMLTextLine | List[PageXMLTextLine], step: int = 50)[source]
pagexml.analysis.layout_stats.compute_columns_stats(columns: List[PageXMLColumn], stats: Dict[str, Dict[str, Counter]])[source]
pagexml.analysis.layout_stats.compute_height_stats(line_heights: array) Dict[str, int][source]
pagexml.analysis.layout_stats.compute_lines_stats(lines: List[PageXMLTextLine], stats: Dict[str, Dict[str, Counter]]) None[source]
pagexml.analysis.layout_stats.compute_pages_stats(pages: List[PageXMLPage], stats: Dict[str, Dict[str, Counter]])[source]
pagexml.analysis.layout_stats.compute_pagexml_stats(docs: List[PageXMLDoc]) Dict[str, Dict[str, Counter]][source]

Compute statistics on the numbers of PageXML elements that are part of a given list of PageXMLDoc objects.

Parameters:

docs (List[PageXMLDoc]) – a list of PageXMLDoc objects

Returns:

A nested dictionary of statistic per PageXML element type

Return type:

Dict[str, Dict[str, Counter]]

pagexml.analysis.layout_stats.compute_points_distances(points1: List[Tuple[int, int]], points2: List[Tuple[int, int]], step: int = 50)[source]
pagexml.analysis.layout_stats.compute_scans_stats(scans: List[PageXMLScan], stats: Dict[str, Dict[str, Counter]])[source]
pagexml.analysis.layout_stats.compute_textregion_distance(tr1: PageXMLTextRegion, tr2: PageXMLTextRegion) int | float[source]
pagexml.analysis.layout_stats.compute_textregions_stats(text_regions: List[PageXMLTextRegion], stats: Dict[str, Dict[str, Counter]]) None[source]
pagexml.analysis.layout_stats.find_line_width_boundary_points(line_widths: List[int], line_bin_size: int = 50, min_ratio: float = 0.25) List[int][source]

Find the minima in the distribution of line widths relative to the peaks in the distribution. These minima represent the boundaries between clusters of lines within the same line width intervals.

Parameters:
  • line_widths (List[int]) – a list of PageXML text line widths

  • line_bin_size (int) – the bin size for grouping lines to establish the line width distribution (default 50 pixels)

  • min_ratio (float) – the minimum ratio between a peak frequency and its neighbouring minimum to determine if the minimum is a category boundary

Returns:

A list of category boundary points

Return type:

List[int]

pagexml.analysis.layout_stats.find_lowest_point(line: PageXMLTextLine) Tuple[int, int][source]

Find the first baseline point that corresponds to the lowest vertical point.

Parameters:

line (PageXMLTextLine) – a PageXML TextLine object with baseline information

Returns:

the left most point that has the lowest vertical coordinate

Return type:

Tuple[int, int]

pagexml.analysis.layout_stats.get_baseline_y(line: PageXMLTextLine) List[int][source]

Return the Y/vertical coordinates of a text line’s baseline.

pagexml.analysis.layout_stats.get_bottom_points(line: PageXMLTextLine) List[Tuple[int, int]][source]
pagexml.analysis.layout_stats.get_boundary_width_ranges(boundary_points: List[int]) List[str][source]
pagexml.analysis.layout_stats.get_line_distances(lines: List[PageXMLTextLine]) List[ndarray][source]
pagexml.analysis.layout_stats.get_line_height_stats(line: PageXMLTextLine, step: int = 50, ignore_errors: bool = False, debug: int = 0) Dict[str, int] | None[source]
pagexml.analysis.layout_stats.get_line_width_stats(lines: List[PageXMLTextLine], boundary_points: List[int]) Counter[source]

Return a Counter object with statistics of the number of lines categorised according to a list of category break points (line widths that are the boundary between categories of line width).

Parameters:
  • lines (List[PageXMLTextLine]) – A list of PageXML text lines

  • boundary_points (List[int]) – A list of line width category boundary points

Returns:

A counter with the number of lines per line width interval

Return type:

Counter

pagexml.analysis.layout_stats.get_line_widths(pagexml_files: List[str | PageXMLTextRegion] = None, line_width_bin_size: int = 50) List[int][source]

Return a list of line widths for the lines in a list of PageXML files.

Parameters:
  • pagexml_files (List[str]) – a list of PageXML filepaths

  • line_width_bin_size (int) – the bin size for grouping lines (default is 50 pixels)

Returns:

a list of line widths

Return type:

List[int]

pagexml.analysis.layout_stats.get_text_heights(line: PageXMLTextLine, step: int = 50, ignore_errors: bool = True, debug: int = 0) array[source]
pagexml.analysis.layout_stats.get_textregion_avg_char_width(text_region: PageXMLTextRegion) float[source]

Return the estimated average (mean) character width, determined as the sum of the width of text lines divided by the sum of the number of characters of all text lines.

Parameters:

text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines

Returns:

the average (mean) character width

Return type:

float

pagexml.analysis.layout_stats.get_textregion_avg_line_distance(text_region: PageXMLTextRegion, avg_type: str = 'macro') float[source]

Returns the median distance between subsequent lines in a textregion object. If the textregion contains smaller textregions, it only considers line distances between lines within the same column (i.e. only lines from textregions that are horizontally aligned.)

By default, the macro-average is returned.

Parameters:
  • text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines

  • avg_type (str) – the type of averging to apply (macro or micro)

Returns:

the median distance between horizontally aligned lines

Return type:

float

pagexml.analysis.layout_stats.get_textregion_avg_line_width(text_region: PageXMLTextRegion, unit: str = 'char') float[source]

Return the estimated average (mean) character width, determined as the sum of the width of text lines divided by the sum of the number of characters of all text lines.

Parameters:
  • text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines

  • unit (str) – the unit to measure line width, either char or pixel

Returns:

the average (mean) character width

Return type:

float

pagexml.analysis.layout_stats.get_textregion_line_distances(text_region: PageXMLTextRegion) List[ndarray][source]

Returns a list of line distance numpy arrays. For each line, its distance to the next at 50 pixel intervals is computed and stored in a numpy ndarray.

Parameters:

text_region (PageXMLTextRegion) – a TextRegion object that contains TextLines

Returns:

a list of numpy ndarrays of line distances

Return type:

List[np.ndarray]

pagexml.analysis.layout_stats.interpolate_baseline_points(points: List[Tuple[int, int]], step: int = 50) Dict[int, int][source]

Determine the x coordinates between each pair of subsequent points on a baseline and calculate their corresponding y coordinates.

Parameters:
  • points (List[Tuple[int, int]]) – the list of points of a baseline object

  • step (int) – the step size in pixels for interpolation

Returns:

a dictionary of interpolated points based on step size

Return type:

Dict[int, int]

pagexml.analysis.layout_stats.interpolate_points(p1: Tuple[int, int], p2: Tuple[int, int], step: int = 50) Generator[Dict[int, int], None, None][source]

Determine the x coordinates between a pair of points on a baseline and calculate their corresponding y coordinates.

Parameters:
  • p1 (Tuple[int, int]) – a 2D point

  • p2 (Tuple[int, int]) – a 2D point

  • step (int) – the step size in pixels for interpolation

Returns:

a generator of interpolated points based on step size

Return type:

Generator[Dict[int, int], None, None]

pagexml.analysis.layout_stats.line_starts_with_big_capital(line: PageXMLTextLine) bool[source]

Determine if a line starts with a capital in a larger font than the rest, which is aligned at the top, so sticks out at the bottom.

pagexml.analysis.layout_stats.sort_coords_above_below_baseline(line: PageXMLTextLine, debug: int = 0) Tuple[List[Tuple[int, int]], List[Tuple[int, int]]][source]

Split the list of bounding polygon coordinates of a line in sets of points above and below the baseline. When a line has no baseline or no bounding polygon, empty lists are returned

Parameters:
  • line (PageXMLTextLine) – a PageXML text line

  • debug (int) – the detail level of debug information (0 = none, higher is more)

Returns:

two lists of bounding polygon points

Return type:

tuple