Extraction¶
The extraction step is the last step of the document parsing flow.
- It consists of extracting structured information from the document content.
- The expected structured information is defined in the document parsing configuration for each category.
How it works¶
- The extraction uses a pydantic model to validate the extracted information.
- The validation include basic JSON schema validation (similar to the one done in the manual parsing tool).
- The extraction is validated using pydantic models for advanced validation that includes basic JSON schema validation done in the manual parsing tool.
- It can go further by implementing safeguards to ensure that the output of the parser is valid.
Dynamic LLM Extractor¶
The dynamic LLM extractor is an LLM extractor using a dynamic few-shot prompting approach. When a document arrives for extraction (alongside its transcription and classification), we prepare a prompt for the LLM that includes:
- Extraction instructions based on the document's classification
- Few-shot examples of similar documents and their validated extraction outputs, retrieved from our reference dataset document pool using text similarity search on the transcriptions (using document embeddings)
- The new document's Markdown transcription as the final user input
flowchart TB
subgraph extraction ["Extraction (new document)"]
direction TB
A(New document arrives with transcription + classification) --> B(Prepare extraction instructions)
A --> C(Compute embedding of transcription)
C --> D(Query pgvector for N most similar documents by L2 distance)
D --> E(For each similar doc: fetch markdown transcription + validated extraction output)
E --> F(Build prompt)
B --> F
end
subgraph prompt ["LLM prompt structure"]
direction TB
P1["SYSTEM: extraction instructions"]
P2["SYSTEM (example_user): similar doc 1 transcription"]
P3["SYSTEM (example_assistant): similar doc 1 extraction (JSON)"]
P4["... more few-shot examples ..."]
P5["USER: new document transcription"]
P1 --- P2 --- P3 --- P4 --- P5
end
SE -.->|grows the example pool| D
F --> prompt
prompt --> G(Parse LLM response as structured extraction)
G -.->|validated extraction feeds back into pool| V
subgraph indexing ["Indexing (after validation)"]
direction LR
V(Document validated) --> CE(Compute embedding of transcription)
CE --> SE(Store embedding + metadata in PostgreSQL with pgvector)
end
Hold "Alt" / "Option" to enable pan & zoom