
text
fields for the Table
elements contain the raw text of the table, and the text_as_html
field contains corresponding HTML representations of the table. However,
you might also want the table’s information output as an invoice
field with, among other details, each of the invoice’s line items having a description
, quantity
, price
, and total
field.
However, neither the default Unstructured text
nor table_as_html
fields present the tables in this way by default.
By using the document data extractor in your Unstructured workflows, you could have Unstructured extract the invoice’s data in a format similar to the following (ellipses indicate omitted fields for brevity):
DocumentData
, has an extracted_data
field within metadata
that contains a representation of the document’s data in the format that you specify. Beginning with the second document element and continuing
until the end of the document, Unstructured also outputs the document’s data as a series of Unstructured’s default document elements and metadata as it normally would.
To use the document data extractor, in addition to your source documents you must provide an extraction guidance prompt and an extraction schema.
An extraction guidance prompt is like a prompt that you would give to a RAG chatbot. This prompt guides Unstructured on how to extract the data from the source documents. For this invoice example, the
prompt might look like the following:
- The top-level
invoice
object contains nested strings, arrays, and objects such asinvoice_no
,invoice_date
,payment_due
,bill_to
,payment_information
,terms_conditions
,notes
,items
,subtotal
,vat
, andtotal
. - The nested
payment_information
object contains nested strings such asaccount_name
,bank_name
, andaccount_no
. - The nested
items
array contains a series of strings, integers, and numbers such asdescription
,quantity
,price
, andtotal
.
Using the document data extractor
- Add a Document Data Extractor node to your existing Unstructured workflow. This node must be added immediately after the Partitioner node in the workflow. To add this node, in the workflow designer, click the + (add node) button, click Transform, and then click Document Data Extractor.
- Click the newly added Document Data Extractor node to select it.
-
In the node’s settings pane, on the Details tab, specify the following:
a. For Extraction Guidance Prompt, enter the text of your extraction guidance prompt.
b. Click Edit Code, enter the text of your extraction schema, and then click Save Changes. The text you entered will appear in the Schema box.
- Continue building your workflow as desired.
-
To see the results of the document data extractor, do one of the following:
- If you are using a local file as input to your workflow, click Test immediately above the Source node. The results will be displayed on-screen in the Test output pane.
- If you are using source and destination connectors for your workflow, run the workflow, monitor the workflow’s job, and then examine the results in your destination location.
Limitations
The document data extractor does not work with the Pinecone destination connector. This is because Pinecone has strict limit on the amount of metadata that it can manage. These limits are below the threshold of what the document data extractor typically needs for the amount of metadata that it manages.Saving the extracted data separately
There might be cases where you want to save the contents of theextracted_data
field separately from the rest of Unstructured’s JSON output.
To do this, you could use a Python script such as the following. This script works with one or more Unstructured JSON output files that you already have stored
on the same machine as this script. Before you run this script, do the following:
-
To process all Unstructured JSON files within a directory, change
None
forinput_dir
to a string that contains the path to the directory. This can be a relative or absolute path. -
To process specific Unstructured JSON files within a directory or across multiple directories, change
None
forinput_file
to a string that contains a comma-separated list of filepaths on your local machine, for example"./input/2507.13305v1.pdf.json,./input2/table-multi-row-column-cells.pdf.json"
. These filepaths can be relative or absolute.Ifinput_dir
andinput_file
are both set to something other thanNone
, then theinput_dir
setting takes precedence, and theinput_file
setting is ignored. -
For the
output_dir
parameter, specify a string that contains the path to the directory on your local machine that you want to send theextracted_data
JSON. If the specified directory does not exist at that location, the code will create the missing directory for you. This path can be relative or absolute.
Additional examples
In addition to the preceding invoice example, here are some more examples that you can adapt for your own use.Caring for houseplants
Using the following image file:
Medical invoicing
Using the following PDF file: