Document(
shards: List[google.cloud.documentai_v1.types.document.Document],
gcs_bucket_name: Optional[str] = None,
gcs_prefix: Optional[str] = None,
gcs_input_uri: Optional[str] = None,
)
Represents a wrapped Document
.
This class hides away the complexities of using Document
protobuf
response outputted by BatchProcessDocuments
or ProcessDocument
methods and implements convenient methods for searching and
extracting information within the Document
.
Optional. The name of the gcs bucket.
Format: gs://{bucket_name}/{optional_folder}/{target_folder}/
where gcs_bucket_name=bucket
.
:type: Optional[str]
(List[Entity]): A list of Entities in the Document.
Attributes | |
---|---|
Name | Description |
gcs_prefix |
Optional[str]
Optional. The prefix of the json files in the target_folder. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_prefix={optional_folder}/{target_folder} .
For more information please take a look at https://cloud.google.com/storage/docs/json_api/v1/objects/list .
|
pages |
Optional[str]
(List[Page]): A list of Pages in the Document. |
Methods
convert_document_to_annotate_file_response
convert_document_to_annotate_file_response()
Convert OCR data from Document.proto to AnnotateFileResponse.proto for Vision API.
Returns | |
---|---|
Type | Description |
AnnotateFileResponse | Proto with TextAnnotations. |
entities_to_bigquery
entities_to_bigquery(
dataset_name: str, table_name: str, project_id: Optional[str] = None
)
Adds extracted entities to a BigQuery table.
Parameters | |
---|---|
Name | Description |
dataset_name |
str
Required. Name of the BigQuery dataset. |
table_name |
str
Required. Name of the BigQuery table. |
project_id |
Optional[str]
Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment. |
Returns | |
---|---|
Type | Description |
bigquery.job.LoadJob | The BigQuery LoadJob for adding the entities. |
entities_to_dict
entities_to_dict()
Returns Dictionary of entities in document.
Returns | |
---|---|
Type | Description |
Dict | The Dict of the entities indexed by type. |
export_images
export_images(
output_path: str, output_file_prefix: str, output_file_extension: str
)
Exports images from Document
to files.
Parameters | |
---|---|
Name | Description |
output_path |
str
Required. The path to the output directory. |
output_file_prefix |
str
Required. The output file name prefix. |
output_file_extension |
str
Required. The output file extension. Format: |
Returns | |
---|---|
Type | Description |
List[str] | A list of output image file names. Format: {output_path}/{output_file_prefix}_{index}_{Entity.type_}.{output_file_extension} |
form_fields_to_bigquery
form_fields_to_bigquery(
dataset_name: str, table_name: str, project_id: Optional[str] = None
)
Adds extracted form fields to a BigQuery table.
Parameters | |
---|---|
Name | Description |
dataset_name |
str
Required. Name of the BigQuery dataset. |
table_name |
str
Required. Name of the BigQuery table. |
project_id |
Optional[str]
Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment. |
Returns | |
---|---|
Type | Description |
bigquery.job.LoadJob | The BigQuery LoadJob for adding the form fields. |
form_fields_to_dict
form_fields_to_dict()
Returns Dictionary of form fields in document.
Returns | |
---|---|
Type | Description |
Dict | The Dict of the form fields indexed by type. |
from_batch_process_metadata
from_batch_process_metadata(
metadata: google.cloud.documentai_v1.types.document_processor_service.BatchProcessMetadata,
)
Loads Documents from Cloud Storage, using the output from BatchProcessMetadata
.
.. code-block:: python
from google.cloud import documentai
operation = client.batch_process_documents(request)
operation.result(timeout=timeout)
metadata = documentai.BatchProcessMetadata(operation.metadata)
Parameter | |
---|---|
Name | Description |
metadata |
documentai.BatchProcessMetadata
Required. The operation metadata after a |
Returns | |
---|---|
Type | Description |
List[Document] | A list of wrapped documents from gcs. Each document corresponds to an input file. |
from_batch_process_operation
from_batch_process_operation(location: str, operation_name: str)
Loads Documents from Cloud Storage, using the operation name returned from batch_process_documents()
.
.. code-block:: python
from google.cloud import documentai
operation = client.batch_process_documents(request)
operation_name = operation.operation.name
Parameters | |
---|---|
Name | Description |
location |
str
Required. The location of the processor used for |
operation_name |
str
Required. The fully qualified operation name for a |
Returns | |
---|---|
Type | Description |
List[Document] | A list of wrapped documents from gcs. Each document corresponds to an input file. |
from_document_path
from_document_path(document_path: str)
Loads Document from local document_path.
.. code-block:: python
from google.cloud.documentai_toolbox import document
document_path = "/path/to/local/file.json
wrapped_document = document.Document.from_document_path(document_path)
Parameter | |
---|---|
Name | Description |
document_path |
str
Required. The path to the document.json file. |
Returns | |
---|---|
Type | Description |
Document | A document from local document_path. |
from_documentai_document
from_documentai_document(
documentai_document: google.cloud.documentai_v1.types.document.Document,
)
Loads Document from local documentai_document.
.. code-block:: python
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
documentai_document = client.process_documents(request).document
wrapped_document = document.Document.from_documentai_document(documentai_document)
Parameter | |
---|---|
Name | Description |
documentai_document |
documentai.Document
Optional. The Document.proto response. |
Returns | |
---|---|
Type | Description |
Document | A document from local documentai_document. |
from_gcs
from_gcs(
gcs_bucket_name: str, gcs_prefix: str, gcs_input_uri: Optional[str] = None
)
Loads Document from Cloud Storage.
Parameters | |
---|---|
Name | Description |
gcs_bucket_name |
str
Required. The gcs bucket. Format: Given |
gcs_prefix |
str
Required. The prefix to the location of the target folder. Format: Given |
gcs_input_uri |
str
Optional. The gcs uri to the original input file. Format: |
Returns | |
---|---|
Type | Description |
Document | A document from gcs. |
get_entity_by_type
get_entity_by_type(target_type: str)
Returns the list of Entities of target_type.
Parameter | |
---|---|
Name | Description |
target_type |
str
Required. target_type. |
Returns | |
---|---|
Type | Description |
List[Entity] | A list of Entity matching target_type. |
get_form_field_by_name
get_form_field_by_name(target_field: str)
Returns the list of FormFields named target_field.
Parameter | |
---|---|
Name | Description |
target_field |
str
Required. Target field name. |
Returns | |
---|---|
Type | Description |
List[FormField] | A list of FormField matching target_field. |
search_pages
search_pages(target_string: Optional[str] = None, pattern: Optional[str] = None)
Returns the list of Pages containing target_string or text matching pattern.
Parameters | |
---|---|
Name | Description |
target_string |
Optional[str]
Optional. target str. |
pattern |
Optional[str]
Optional. regex str. |
Returns | |
---|---|
Type | Description |
List[Page] | A list of Pages. |
split_pdf
split_pdf(pdf_path: str, output_path: str)
Splits local PDF file into multiple PDF files based on output from a Splitter/Classifier processor.
Parameters | |
---|---|
Name | Description |
pdf_path |
str
Required. The path to the PDF file. |
output_path |
str
Required. The path to the output directory. |
Returns | |
---|---|
Type | Description |
List[str] | A list of output pdf files. |