Two Approaches to B2B Document Extraction: Rules vs. Large Language Models

Introduction

Automating the extraction of structured data from B2B documents—such as purchase orders, invoices, and shipment confirmations—is a common pain point. While traditional rule-based methods have been the go‑to solution for decades, the emergence of large language models (LLMs) offers an alternative that promises greater flexibility. In this article, we compare two implementations of a document extractor built for a realistic B2B order scenario: one using a rule‑based system with pytesseract (an OCR engine) and the other using an LLM approach with Ollama and LLaMA 3. We examine accuracy, speed, cost, and maintainability to help you decide which path suits your use case.

Two Approaches to B2B Document Extraction: Rules vs. Large Language Models — Source: towardsdatascience.com

The Rule‑Based Approach

How It Works

The rule‑based extractor relies on pytesseract, a Python wrapper for Google’s Tesseract OCR engine. The process begins with image preprocessing (deskewing, binarisation, and noise removal) followed by OCR to convert the scanned PDF or image into raw text. Hand‑crafted regular expressions and positional heuristics then extract fields such as order number, date, line items, and totals. For example, a pattern like Order\s*#:\s*(\d+) captures the order number, while table boundaries are guessed based on horizontal lines or consistent spacing.

Strengths and Weaknesses

Speed: Once rules are in place, inference is fast—often under a second per page.
Cost: Only computational resources are required; no API fees.
Determinism: The same input always produces the same output, which is valuable for auditing.
Brittleness: Small layout changes (e.g., a new field or a different font) break the rules. Maintaining rules across dozens of vendors becomes a burden.
Limited Context: The system cannot infer meaning—it only matches patterns. A date field that appears in multiple ambiguous locations may be misinterpreted.

The LLM‑Based Approach

How It Works

The LLM‑based system uses Ollama to run LLaMA 3 locally. The scanned document is first processed by an OCR layer (the same pytesseract) to extract all visible text, but instead of applying rules, the entire plain‑text output is fed into a prompt that instructs the model to return a structured JSON object containing the required fields. The prompt includes a few examples (few‑shot prompting) and describes the expected schema.

Strengths and Weaknesses

Flexibility: The model can interpret text variations, reorder fields, and even correct minor OCR errors. A vendor that changes its invoice layout does not require rule updates.
Contextual Understanding: The LLM grasps the semantic meaning of “Total Amount Due” even when it appears in an unexpected location.
Slower Inference: Running LLaMA 3 locally takes several seconds per page, and latency grows with document length.
Higher Resource Cost: Even a smaller model like LLaMA 3‑8B requires a modern GPU for reasonable speed. Larger variants demand significant memory.
Output Variability: Responses are probabilistic; occasional hallucinations or format deviations require post‑processing validation.

Comparative Analysis

Accuracy

In our B2B order test set (which included 50 invoices from five different vendors with varying layouts), the rule‑based system achieved 92% field‑level accuracy, mostly failing on oddly placed line‑item tables. The LLM approach reached 97% accuracy, successfully handling variations like merged cells and missing headers. However, the LLM occasionally invented a field when the information was truly missing (a false positive), whereas the rule system simply returned null.

Speed and Throughput

For a single‑page order, the rule‑based extractor processed 100 documents in 40 seconds (0.4 s per doc). The LLM took 6 minutes and 20 seconds (3.8 s per doc) using a NVIDIA A10G GPU. On a CPU‑only machine, LLM inference was impractically slow (over 30 s per doc).

Maintenance and Flexibility

Over a six‑month period, the rule‑based system required 12 manual updates to adapt to vendor template changes. The LLM‑based system required none—new layouts were handled without code changes. On the other hand, the LLM system needed occasional prompt tuning (e.g., adding examples for a new vendor’s abbreviation style).

Cost

Rule‑based: No API costs, but engineering time for rule maintenance adds up. Estimated annual effort: 1–2 person‑weeks.
LLM‑based: High upfront GPU cost ($2,000–$5,000) and GPU‑time for inference. Cloud GPU rental costs ~$0.50 per hour, translating to $0.0018 per page (vs. negligible for rules).

Practical Recommendations

Choose the rule‑based approach if:

Your documents follow predictable, standardized layouts (e.g., a single vendor).
Throughput and low latency are critical.
You can invest in upfront rule development and ongoing maintenance.

Choose the LLM‑based approach if:

You handle documents from many vendors with frequently changing layouts.
You have GPU resources available (on‑premises or cloud).
Accuracy is more important than speed, and you can tolerate a few seconds of latency per page.

For many teams, a hybrid approach works best: use rules for simple, high‑volume documents and fall back on an LLM for complex or unknown layouts. This balances cost, speed, and accuracy.

Conclusion

Both pytesseract‑based rules and Ollama‑powered LLMs can successfully extract data from B2B documents, but they serve different needs. The rule system is fast, cheap, and deterministic, yet brittle. The LLM system is flexible and accurate but slower and more expensive. By understanding the trade‑offs described in this comparison, you can select the best tool—or combine both—for your document extraction pipeline.

Tags:

Two Approaches to B2B Document Extraction: Rules vs. Large Language Models

Introduction

The Rule‑Based Approach

How It Works

Strengths and Weaknesses

The LLM‑Based Approach

How It Works

Strengths and Weaknesses

Comparative Analysis

Accuracy

Speed and Throughput

Maintenance and Flexibility

Cost

Practical Recommendations

Conclusion

Recommended

Discover More