Extraction¶
The extraction step is the last step of the document parsing flow.
- It consists of extracting structured information from the document content.
- The expected structured information is defined in the document parsing configuration for each category.
How it works¶
- The extraction uses a pydantic model to validate the extracted information.
- The validation include basic JSON schema validation (similar to the one done in the manual parsing tool).
- The extraction is validated using pydantic models for advanced validation that includes basic JSON schema validation done in the manual parsing tool.
- It can go further by implementing safeguards to ensure that the output of the parser is valid.
Dynamic LLM Extractor¶
The dynamic LLM extractor is an extractor that uses a dynamic LLM model to extract structured information from a document. It consists of providing to the LLM model:
- The instruction to extract the document
- The document transcription.
- And a set of similar examples with their input (transcription) and expected output (structured information). This set of similar examples if computed using document embeddings.