Extraction¶

The extraction step is the last step of the document parsing flow.

It consists of extracting structured information from the document content.
The expected structured information is defined in the document parsing configuration for each category.

How it works¶

The extraction uses a pydantic model to validate the extracted information.
The validation include basic JSON schema validation (similar to the one done in the manual parsing tool).
The extraction is validated using pydantic models for advanced validation that includes basic JSON schema validation done in the manual parsing tool.
It can go further by implementing safeguards to ensure that the output of the parser is valid.

Dynamic LLM Extractor¶

The dynamic LLM extractor is an extractor that uses a dynamic LLM model to extract structured information from a document. It consists of providing to the LLM model:

The instruction to extract the document
The document transcription.
And a set of similar examples with their input (transcription) and expected output (structured information). This set of similar examples if computed using document embeddings.