pagexml.column_parser module

pagexml.column_parser.column_bounding_box_surrounds_lines(column: PageXMLColumn) → bool[source]: Check if the column coordinates contain the coordinate boxes of the column lines.

pagexml.column_parser.compute_pixel_dist(lines: List[PageXMLTextLine]) → Counter[source]: Count how many lines are above each horizontal pixel coordinate.

pagexml.column_parser.determine_column_type(column: PageXMLColumn) → str[source]: Determine whether a column is a full-text column, margin column or extra text column.

pagexml.column_parser.determine_freq_gap_interval(pixel_dist: Counter, gap_threshold: int) → list[source]

pagexml.column_parser.find_column_gaps(lines: List[PageXMLTextLine], gap_threshold: int = 50)[source]

pagexml.column_parser.find_overlapping_columns(columns: List[PageXMLColumn])[source]

pagexml.column_parser.handle_extra_lines(text_region: PageXMLTextRegion, columns: List[PageXMLColumn], extra_lines: List[PageXMLTextLine], gap_threshold: int = 50, debug: bool = False)[source]

pagexml.column_parser.is_full_text_column(column: PageXMLColumn, page: PageXMLTextRegion = None, num_page_cols: int = 2) → bool[source]: Check if a page column is a full-text column (running from top to bottom of page).

pagexml.column_parser.is_header_footer_column(column: PageXMLColumn) → bool[source]: Check if a column is a header or footer.

pagexml.column_parser.is_noise_column(column: PageXMLColumn) → bool[source]: Check if columns contains only very short lines.

pagexml.column_parser.is_text_column(column: PageXMLColumn) → bool[source]: Check if there is at least one alpha-numeric word on the page.

pagexml.column_parser.make_column_range_columns(text_region: PageXMLTextRegion, column_lines: List[List[PageXMLTextLine]]) → List[PageXMLColumn][source]

pagexml.column_parser.make_derived_column(lines: List[PageXMLTextLine], metadata: dict, page_id: str) → PageXMLColumn[source]: Make a new PageXMLColumn based on a set of lines, column metadata and a page_id.

pagexml.column_parser.merge_columns(columns: List[PageXMLColumn], doc_id: str, metadata: dict) → PageXMLColumn[source]: Merge a list of columns into one. First, all text regions of all columns are checked for spatial overlap, whereby overlapping text regions are merged. Within the merged text regions, lines are sorted by baseline height.

pagexml.column_parser.merge_overlapping_columns(text_region: PageXMLTextRegion, columns: List[PageXMLColumn])[source]

pagexml.column_parser.new_gap_pixel_interval(pixel: int) → dict[source]

pagexml.column_parser.sort_lines_in_column_ranges(lines: List[PageXMLTextLine], column_ranges: List[Dict[str, int]], overlap_threshold: float, debug: bool = False) → Tuple[List[List[PageXMLTextLine]], List[PageXMLTextLine]][source]

pagexml.column_parser.split_lines_on_column_gaps(text_region: PageXMLTextRegion, gap_threshold: int = 50, overlap_threshold: float = 0.5) → List[PageXMLColumn][source]

Takes a PageXMLTextRegion object and tries to split the lines into columns based on a minimum horizontal gap (in number of pixels) between columns.

Parameters:

text_region (PageXMLTextRegion) – a text region with lines (as direct children or as deeper descendants).
gap_threshold (int) – the minimum number of horizontal pixels between columns of horizontally aligned lines to be considered a column boundary. Default is 50.
overlap_threshold (float) – the minimum overlap ratio between two lines to be considered horizontally aligned (i.e. part of the same ‘column’). Default is 0.5, that is, two lines need to horizontally overlap at least 50% of the shortest line.

pagexml.column_parser.within_column(line: PageXMLTextLine, column_range: Dict[str, int], overlap_threshold: float = 0.5)[source]: Determine if a given line is within the horizontal range of a column.