Outperforming Industry-Standard Table Extraction: Building Our Own Data Pipeline
In the oil and gas industry, production data is everything. Daily production values form the foundation of well valuations and production forecasting. Get those numbers wrong or get them late, and your models fall apart.
We have access to roughly 180,000 pages per year of the most accurate production data available. However, it sits unused. The data exists only as tables in PDFs, and the unstructured format makes it impossible to feed directly into our system.
Currently, we rely on structured data, which comes with a three-month lag and lower accuracy. For decision-making, a three-month lag feels like a lifetime, and lower accuracy skews models and distorts valuations.
We built an in-house pipeline to extract, structure, and validate this data automatically. It outperforms industry-standard table extraction tools, and does so at a fraction of the cost of alternative solutions. Here's how we did it.
Why Existing Tools Failed
Our first instinct was to use existing tools. Surely someone had already solved PDF table extraction.
Our simplest approach was to consider rule-based systems, but they were out of the question from the start. Every operator formatted their PDFs differently: different column counts, different column names, different table structures. Formats changed over time, sometimes without notice. A rigid, rule-based approach would require constant maintenance and would break with every new variation.
The next logical step was Optical Character Recognition (OCR)-based extraction, but it failed for different reasons. Values appeared visibly cut off or overlapping in the rendered tables. The embedded PDF data was technically correct, but the visual presentation was problematic. OCR models could not parse what they "saw," and even human readers struggled with the same issues. Open-source computer vision models like Microsoft's Table Transformer could detect tables reliably but struggled to extract accurate values from them.
We considered training custom models, including computer vision models, LLMs, and agentic systems, to extract table data specifically for our use case. However, we had no ground truth data to train on. The most accurate production data existed only in these PDFs, with no corresponding structured dataset to use as training labels.
We tried off-the-shelf solutions: Google Document AI, LandingAI's Agentic Document Extraction (built by Andrew Ng's team), and direct LLM ingestion of PDFs. The results were disappointing: merged columns, single tables detected as multiple overlapping tables, incorrect column assignments, and major extraction errors.
These failures pointed us toward a different approach: large language models that could handle ambiguity, flexibility and semantic variation without requiring explicit training data.
The Solution: A Multi-Stage Pipeline
Rather than relying on a single tool or approach, we built a multi-stage pipeline where each component handles a specific part of the extraction process.
Pipeline Stages

Stage 1: Table Detection We use Microsoft's open-source Table Transformer to detect table regions in each PDF and crop them for processing.
Stage 2: Text Extraction We extract text from the cropped table regions using Fitz (PyMuPDF). We initially tried multiple pdf processing packages such as pdfplumber, but Fitz proved orders of magnitude faster for our volume of 180k pages per year.
Stage 3: Preprocessing We perform row segmentation and format the extracted text for LLM ingestion, ensuring the data is structured in a way the model can interpret reliably.
Stage 4: Prompt Construction We generate dynamic, operator-aware prompts using an external configuration file. This configuration defines column descriptions, column aliases, which columns to add or remove, and which to ignore. This flexibility allows the pipeline to handle format variations across operators without code changes.
Stage 5: LLM Extraction Gemini 2.5 Pro converts the format table text into structured data, understanding semantic variations and mapping columns correctly despite inconsistent naming.
Stage 6: Post-Processing We ensure correct CSV output: consistent column counts, proper comma placement, and concatenation of pages (since the LLM processes page by page). We also handle any unexpected outputs from the model. If formatting errors cannot be systematically corrected, the affected portions are retried through stages 4-6.
Stage 7: Live Validation Every PDF undergoes real-time validation. We run both simple checks (API number formats, date consistency, no missing dates) and complex validations (comparing PDF-reported totals against our extracted structured output). Any validation failures trigger a retry through stages 4-7 for the specific pages, rows, or columns that failed.
Why Multi-Stage?
Breaking the pipeline into discrete stages provides several advantages. Each stage can be optimized independently. Debugging becomes straightforward since we can isolate failures to specific components. The embedded retry mechanism only reruns affected portions through the necessary stages rather than reprocessing entire PDFs, saving both time and cost.
The Validation Framework
Achieving high extraction accuracy was only part of the problem. We also needed to verify that accuracy automatically, at scale, without manual review of around 180k pages per year.
Our validation framework checks every extracted value against a set of rules and constraints. We manually verified the framework itself to ensure it was catching genuine errors. After that, it runs autonomously on every PDF without requiring human labels or ground truth data.
What Are Totals?
To understand how our most important validation works, we first need to explain the structure of these operator-reported PDFs.
All PDFs include summary rows or footer sections that report totals. These totals aggregate production values by API number (a unique well identifier) or specific time intervals. For example, a PDF might report that well API 42-123-45678 produced 1,500 barrels of oil across a specific date range, with daily breakdowns listed in the previous rows.
These totals gave us a built-in validation mechanism. If our extracted daily values for a given API and time period do not sum to the reported total, something went wrong.
How Validation Works
We run two types of checks on every PDF:
- Simple validations catch formatting and structural errors. API (well identifier) numbers must be 10, 12, or 14 digits. Dates must be consistent and complete with no gaps in the reported time series. These checks are fast and catch obvious extraction failures.
- Complex validations reconcile our extracted data against the PDF-reported totals. We sum our extracted production values by API and time interval, then compare those sums to the totals stated in the PDF. Any discrepancy flags a potential extraction error.
All validation failures are mapped to exact rows and pages. A failed check tells us precisely which page, row, and possibly which API and date range needs correction.
Validation Outputs
The framework produces readable summaries for each PDF, designed specifically for non-technical stakeholders. These summaries classify each PDF as passed or failed and describe any issues in plain language without requiring knowledge of the underlying pipeline. A supervisor can quickly understand which PDFs need attention and why, without needing to interpret technical logs or error codes.
Behind these summaries, the framework maintains detailed technical logs that capture exact row numbers, page locations, failing values, and error types. These logs are essential for debugging and feed directly into the retry mechanism.
We also generate centralized manifests that aggregate validation results across all processed PDFs, providing a high-level view of system health and highlighting patterns in failures.
An Unexpected Benefit
The validation framework does more than verify our extraction accuracy. It also audits the original PDFs.
The framework flagged PDFs with incorrect operator-reported totals, missing dates, and inconsistent production values. These were errors in the source PDFs, not our extraction. Operators had submitted flawed reports, and our system caught them.
This turned the validation framework into a data quality tool that works both directions. It validates our pipeline and validates the incoming PDFs themselves.
The Surgical Retry Mechanism
When the pipeline fails, we do not reprocess the entire PDF. Instead, we retry only the specific portions that failed.
This surgical approach is what allows us to achieve near-perfect accuracy without wasting compute or time.
When Retries Are Triggered
Retries are triggered by two types of failures:
- Post-processing failures occur when the LLM output cannot be formatted correctly. This might be inconsistent column counts, malformed CSV structure, or unexpected data types. When these failures happen, the affected pages are sent back through stages 4 through 6: prompt construction, LLM extraction, and post-processing.
- Performance-based validation failures occur when extracted data fails simple or complex checks. A missing date, an incorrect API format, or a total that does not reconcile triggers a retry. The affected pages, rows, or columns are sent back through stages 4 through 7: prompt construction, LLM extraction, post-processing, and validation.
How Retries Work
Each retry is localized to the exact failure point. If a single row on page 3 has a date formatting issue, only that row on that page is retried. The retry prompt is enriched with context. It includes the previous LLM output, the exact location of the failure (page number, row range, column name), and a detailed description of what went wrong. This gives the model everything it needs to correct the specific issue without starting from scratch.
For example, a retry prompt might give the 4th page of the PDF to the LLM along with the previous output and mention the problems:
- Rows 7-14: For 'API': '12-3456-7890' the 'Production Date': '2025-09-14' is missing.
- Rows 1-13: 'Oil Production' total should amount to '1234'; however, current value is '1000'.
Why This Matters
Retries are expensive if done naively. Reprocessing an entire 100-page PDF because one row failed would multiply costs and latency by orders of magnitude at our scale of 180k pages per year.
Surgical retries keep costs low and accuracy high. Most PDFs require zero retries. Some require one or two targeted corrections. The system handles these automatically, mapping failures to exact locations and re-prompting only what needs fixing.
Results
The pipeline 99.99% accuracy, defined as passing all cell-level validation checks. Most PDFs process successfully on the first pass. A small percentage require one or two surgical retries to correct specific rows or columns. The system handles these automatically without human intervention.
Where industry-standard tools like Google Document AI and LandingAI's Agentic Document Extraction struggled with merged columns, overlapping tables, and structural errors, our multi-stage approach consistently delivers accurate extraction. The 0.01% of extractions that "failed" validation had value differences of less than 0.0001%. These errors have no practical impact on forecasting or valuation models.
Beyond accurate extraction, the validation framework became an unexpected data quality tool. It flags errors in the original PDFs, catching incorrect operator-reported totals, missing dates, and inconsistent production values. This turns the system into a bidirectional audit: it validates our extraction and validates the source data itself.
Beyond Production Values: Extending the Pipeline
The modular design makes this pipeline adaptable to other unstructured PDF extraction problems. For a new use case, only three components need modification:
- Configuration file: Handles specific column mapping (column names, aliases, descriptions) and PDF specific instructions.
- Prompt Template: Uses the configuration file to provide domain-specific context to ensure correct LLM extraction.
- Validation/Evaluation: Can be as simple as cell-by-cell ground truth comparison or as complex as domain-specific rules like our totals reconciliation.
Everything else (table detection, text extraction, preprocessing, LLM extraction, post-processing, and the retry mechanism) remains unchanged.
Conclusion
The key to making this work was combining LLM semantic understanding with traditional validation rules. The multi-stage design allowed us to optimize each component independently. Domain-specific validation gave us confidence in the output without requiring manual ground truth labels.