Class Document (0.14.1a0)

Document(
    shards: typing.List[google.cloud.documentai_v1.types.document.Document],
    gcs_bucket_name: typing.Optional[str] = None,
    gcs_prefix: typing.Optional[str] = None,
    gcs_uri: typing.Optional[str] = None,
    gcs_input_uri: typing.Optional[str] = None,
)

Represents a wrapped Document.

This class hides away the complexities of using the Document protobuf response outputted by BatchProcessDocuments or ProcessDocument methods and implements convenient methods for searching and extracting information within the Document.

Attributes
Name	Description
`shards :noindex:`	`List[google.cloud.documentai.Document]` Required. A list of `documentai.Document` shards of the same `Document`. Each shard consists of a number of pages in the `Document`.
`gcs_bucket_name :noindex:`	`Optional[str]` Optional. The name of the gcs bucket. Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where `gcs_bucket_name=bucket`.
`gcs_prefix :noindex:`	`Optional[str]` Optional. The prefix of the json files in the target_folder. Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where `gcs_prefix={optional_folder}/{target_folder}`. For more information, refer to https://cloud.google.com/storage/docs/json_api/v1/objects/list
`gcs_input_uri :noindex:`	`str` Optional. The gcs uri to the original input file. Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/{file_name}.pdf`
`pages :noindex:`	`List[Page]` A list of `Pages` in the `Document`.
`entities :noindex:`	`List[Entity]` A list of un-nested `Entities` in the `Document`.
`chunks :noindex:`	`Iterator[documentai.Document.ChunkedDocument.Chunk]` An iterator of document chunks extracted from a Layout Parser.
`document_layout_blocks :noindex:`	`Iterator[documentai.Document.DocumentLayout.DocumentLayoutBlock]` An iterator of document layout blocks extracted from a Layout Parser.
`text :noindex:`	`str` The full text of the `Document`.

Methods

convert_document_to_annotate_file_json_response

convert_document_to_annotate_file_json_response() -> str

Convert OCR data from Document.proto to JSON str of AnnotateFileResponse for Vision API.

Returns
Type	Description
`str`	JSON string of `TextAnnotations`.

convert_document_to_annotate_file_response

convert_document_to_annotate_file_response() -> (
    google.cloud.vision_v1.types.image_annotator.AnnotateFileResponse
)

Convert OCR data from Document.proto to AnnotateFileResponse.proto for Vision API.

Returns
Type	Description
`AnnotateFileResponse`	Proto with `TextAnnotations`.

entities_to_bigquery

entities_to_bigquery(
    dataset_name: str, table_name: str, project_id: typing.Optional[str] = None
) -> google.cloud.bigquery.job.load.LoadJob

Adds extracted entities to a BigQuery table.

Parameters
Name	Description
`dataset_name`	`str` Required. Name of the BigQuery dataset.
`table_name`	`str` Required. Name of the BigQuery table.
`project_id`	`Optional[str]` Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment.

Returns
Type	Description
`bigquery.job.LoadJob`	The BigQuery `LoadJob` for adding the entities.

entities_to_dict

entities_to_dict() -> typing.Dict[str, typing.Union[str, typing.List[str]]]

Returns Dictionary of entities in document.

Returns
Type	Description
`Dict`	The Dict of the entities indexed by type.

export_hocr_str

export_hocr_str(title: str) -> str

Exports a string hOCR version of the Document.

The format for the id of the object follows as such:
    object_{page_index}_...

For example words will have the following id format:
    word_{page_index}_{block_index}_{paragraph_index}_{line_index}_{word_index}

Parameter
Name	Description
`title`	`str` Required. The title for hocr_page and head.

Returns
Type	Description
`str`	A string hOCR version of the Document

export_images

export_images(
    output_path: str, output_file_prefix: str, output_file_extension: str
) -> typing.List[str]

Exports images from Document.entities to files. Only exports Portrait entities.

Parameters
Name	Description
`output_path`	`str` Required. The path to the output directory.
`output_file_prefix`	`str` Required. The output file name prefix.
`output_file_extension`	`str` Required. The output file extension. Format: `png`, `jpg`, etc.

Returns
Type	Description
`List[str]`	A list of output image file names. Format: `{output_path}/{output_file_prefix}_{index}_{Entity.type_}.{output_file_extension}`

form_fields_to_bigquery

form_fields_to_bigquery(
    dataset_name: str, table_name: str, project_id: typing.Optional[str] = None
) -> google.cloud.bigquery.job.load.LoadJob

Adds extracted form fields to a BigQuery table.

Parameters
Name	Description
`dataset_name`	`str` Required. Name of the BigQuery dataset.
`table_name`	`str` Required. Name of the BigQuery table.
`project_id`	`Optional[str]` Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment.

Returns
Type	Description
`bigquery.job.LoadJob`	The BigQuery `LoadJob` for adding the form fields.

form_fields_to_dict

form_fields_to_dict() -> typing.Dict[str, typing.Union[str, typing.List[str]]]

Returns dictionary of form fields in document.

Returns
Type	Description
`Dict[str, Union[str, List[str]]]`	The Dict of the form fields indexed by type.

from_batch_process_metadata

from_batch_process_metadata(
    metadata: google.cloud.documentai_v1.types.document_processor_service.BatchProcessMetadata,
) -> typing.List[google.cloud.documentai_toolbox.wrappers.document.Document]

Loads Documents from Cloud Storage, using the output from BatchProcessMetadata.

.. code-block:: python

    from google.cloud import documentai
    from google.cloud.documentai_toolbox import document

    operation = client.batch_process_documents(request)
    operation.result(timeout=timeout)
    metadata = documentai.BatchProcessMetadata(operation.metadata)
    wrapped_document = document.Document.from_batch_process_metadata(metadata)

Parameter
Name	Description
`metadata`	`documentai.BatchProcessMetadata` Required. The operation metadata after a `batch_process_documents()` operation completes.

Returns
Type	Description
`List[Document]`	A list of wrapped documents from gcs. Each document corresponds to an input file.

from_batch_process_operation

from_batch_process_operation(
    location: str, operation_name: str, timeout: typing.Optional[float] = None
) -> typing.List[google.cloud.documentai_toolbox.wrappers.document.Document]

Loads Documents from Cloud Storage, using the operation name returned from batch_process_documents().

.. code-block:: python

    from google.cloud import documentai
    from google.cloud.documentai_toolbox import document

    operation = client.batch_process_documents(request)
    operation_name = operation.operation.name
    wrapped_document = document.Document.from_batch_process_operation(operation_name)

Parameters
Name	Description
`location`	`str` Optional. The location of the processor used for `batch_process_documents()`. Deprecated. Maintained for backwards compatibility.
`operation_name`	`str` Required. The fully qualified operation name for a `batch_process_documents()` operation. Format: `projects/{project}/locations/{location}/operations/{operation}`
`timeout`	`float` Optional. Default None. Time in seconds to wait for operation to complete. If None, will wait indefinitely.

Returns
Type	Description
`List[Document]`	A list of wrapped documents from gcs. Each document corresponds to an input file.

from_document_path

from_document_path(
    document_path: str,
) -> google.cloud.documentai_toolbox.wrappers.document.Document

Loads Document from local document_path.

.. code-block:: python

    from google.cloud.documentai_toolbox import document

    document_path = "/path/to/local/file.json"
    wrapped_document = document.Document.from_document_path(document_path)

Parameter
Name	Description
`document_path`	`str` Required. The path to the `document.json` file or directory containing sharded `document.json` files.

Returns
Type	Description
`Document`	A document from local `document_path`.

from_documentai_document

from_documentai_document(
    documentai_document: google.cloud.documentai_v1.types.document.Document,
) -> google.cloud.documentai_toolbox.wrappers.document.Document

Loads Document from local documentai_document.

.. code-block:: python

    from google.cloud import documentai
    from google.cloud.documentai_toolbox import document

    documentai_document = client.process_documents(request).document
    wrapped_document = document.Document.from_documentai_document(documentai_document)

Parameter
Name	Description
`documentai_document`	`documentai.Document` Required. The `Document.proto` response.

Returns
Type	Description
`Document`	A document from local `documentai_document`.

from_gcs

from_gcs(
    gcs_bucket_name: str, gcs_prefix: str, gcs_input_uri: typing.Optional[str] = None
) -> google.cloud.documentai_toolbox.wrappers.document.Document

Loads a Document from a Cloud Storage directory.

Parameters
Name	Description
`gcs_bucket_name`	`str` Required. The gcs bucket. Format: Given `gs://{bucket_name}/{optional_folder}/{operation_id}/` where `gcs_bucket_name={bucket_name}`.
`gcs_prefix`	`str` Required. The prefix to the location of the target folder. Format: Given `gs://{bucket_name}/{optional_folder}/{target_folder}` where `gcs_prefix={optional_folder}/{target_folder}`.
`gcs_input_uri`	`str` Optional. The gcs uri to the original input file. Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/{file_name}.pdf`

Returns
Type	Description
`Document`	A document from gcs.

from_gcs_uri

from_gcs_uri(
    gcs_uri: str, gcs_input_uri: typing.Optional[str] = None
) -> google.cloud.documentai_toolbox.wrappers.document.Document

Loads a Document from a Cloud Storage uri.

Parameters
Name	Description
`gcs_uri`	`str` Required. The full GCS uri to a Document JSON file. Example: `gs://{bucket_name}/{optional_folder}/{target_file}.json`.
`gcs_input_uri`	`str` Optional. The gcs uri to the original input file. Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/{file_name}.pdf`

Returns
Type	Description
`Document`	A document from gcs.

get_entity_by_type

get_entity_by_type(
    target_type: str,
) -> typing.List[google.cloud.documentai_toolbox.wrappers.entity.Entity]

Returns the list of Entities of target_type.

Parameter
Name	Description
`target_type`	`str` Required. Target entity type.

Returns
Type	Description
`List[Entity]`	A list of `Entity` matching `target_type`.

get_form_field_by_name

get_form_field_by_name(
    target_field: str,
) -> typing.List[google.cloud.documentai_toolbox.wrappers.page.FormField]

Returns the list of FormFields named target_field.

Parameter
Name	Description
`target_field`	`str` Required. Target field name.

Returns
Type	Description
`List[FormField]`	A list of `FormField` matching `target_field`.

search_pages

search_pages(
    target_string: typing.Optional[str] = None, pattern: typing.Optional[str] = None
) -> typing.List[google.cloud.documentai_toolbox.wrappers.page.Page]

Returns the list of Pages containing target_string or text matching pattern.

Parameters
Name	Description
`target_string`	`Optional[str]` Optional. target str.
`pattern`	`Optional[str]` Optional. regex str.

Returns
Type	Description
`List[Page]`	A list of Pages.

split_pdf

split_pdf(pdf_path: str, output_path: str) -> typing.List[str]

Splits local PDF file into multiple PDF files based on output from a Splitter processor.

Parameters
Name	Description
`pdf_path`	`str` Required. The path to the PDF file.
`output_path`	`str` Required. The path to the output directory.

Returns
Type	Description
`List[str]`	A list of output pdf files.

to_merged_documentai_document

to_merged_documentai_document() -> (
    google.cloud.documentai_v1.types.document.Document
)

Exports a documentai.Document from the wrapped document with shards merged.

Returns
Type	Description
`documentai.Document`	Document with all shards merged and text offsets applied.