News
|
January 21, 2025

AI-Enabled Data Extraction from Structured and Unstructured Documents

All Industries
Product Tutorials
Back to All News
AI-Enabled Data Extraction from Structured and Unstructured Documents

I’m Kunal Bargotra, Product Manager at Adlib, and I’m excited to share some insights about one of the most important (and frankly, most frustrating) challenges facing businesses today: extracting critical data from documents.

Every organization deals with invoices, contracts, employee agreements, and other files that are packed with essential information. But getting to that information? That’s a different story. Some of it is buried in multi-line text, scattered across pages, or hidden behind different labels like "Bill To," "Ship To," or "Employer Name." If you’ve ever tried to automate data extraction from these kinds of documents, you know it’s not as simple as pointing to a field and saying, “That’s the one.”

With our Adlib Transform 2024.2 release, we’ve made it possible to extract data from both structured and unstructured documents leveraging our AiLink connector to Large Language Models (LLMs) — and we’ve done it in a way that’s intuitive, flexible, and built for real-world business challenges.

I want to walk you through why unstructured data extraction is so difficult, how Adlib makes it easier, and what sets us apart.

The Challenge of Extracting Data from Unstructured Documents

If you’ve worked with OCR (optical character recognition) tools, you know they’re great at turning images into text. But OCR only reads line-by-line, left-to-right, with no real context. It’s like scanning a page with a highlighter but having no clue what’s actually important.

Here’s where it gets tricky:

  1. No Consistent Layouts
    Unstructured documents like contracts and agreements don’t follow a template. Unlike a form or spreadsheet, where you know "Name" will always be in cell B2, an employee agreement might list the employee's name on page 1, page 4, or not label it at all. You can’t rely on simple “field matching” rules.
  2. Different Names for the Same Thing
    In an invoice, the "Customer Name" might be labeled "Bill To" or "Client" depending on who created the form. The "Invoice ID" might be preceded by a hashtag (#24210) instead of the words “Invoice ID.” Without human context, it’s hard for software to recognize these as the same thing.
  3. Multi-line Descriptions
    Item descriptions in invoices don’t always fit on one line. OCR reads one line at a time, but it doesn’t "know" that "Belkin Router Accessories" on one line is part of the same item as "Bluetooth Adapter" on the next. Traditional tools get tripped up by this.
  4. Context Matters
    Numbers on a page aren’t always obvious. If you see "105,000," is it a base salary, a bonus, or a payment ID? Without context, there’s no way to know. Human reviewers understand that "Annual Base Salary: $105,000" is referring to salary, but AI needs help.

These challenges are the reason why most data extraction systems fall back on rigid templates — and why they break the moment the document format changes.

How Adlib Solves It (And Why It’s Different)

In Adlib Transform 2024.2 we’ve built a system that understands documents. By combining document transformation, LLMs, and prompt engineering, we’ve created a solution that can handle both structured and unstructured content.

Here’s how it works:

1. Document Preparation & OCR Conversion

The process starts by taking in a document (PDFs, scanned images, etc.) and using OCR to convert it into machine-readable text. But we go further than traditional OCR by adding logic and structure to that text, which primes it for extraction. This ensures the LLM can see the document the way a human would.

2. Template-Free Extraction Logic

Here’s the part I’m most excited about. Instead of forcing users to create templates for every single document type, we let you define what you want to extract.

For example, if you’re working with invoices, you might tell the system to extract:

  • Customer Name (even if it’s called "Bill To" or "Client" on the form)
  • Invoice ID (even if it’s just "#24210" next to a symbol)
  • Product Descriptions (even if they span multiple lines)

If you’re working with an employee agreement, you might want to extract:

  • Employee Name
  • Employer Name
  • Base Salary
  • Annual Bonus Percentage

Instead of making you hunt for each field yourself, Adlib’s system knows how to find them, even if they’re phrased differently from one document to the next.

3. User-Friendly Prompt Engineering

This is one of my favorite parts. With Adlib’s interface, you don’t need to be an AI expert to guide the LLM. You can provide simple, plain-English instructions.

Here’s an example:

“Look for the customer address on the top-right of the first page.”

This plain-language approach means you don’t have to learn AI syntax or code prompts from scratch. You just tell the system what to look for, and it takes care of the rest.

4. Contextual Awareness

The real magic happens here. Adlib doesn't just "read" — it "understands."

In one of our invoice examples, the system knew that “Ship To” was the Customer Address, even though it wasn’t explicitly labeled. It also recognized that #24210 was an Invoice ID, even though it wasn’t called "Invoice ID" anywhere.

In an employment agreement, the system was able to recognize that the signatures at the bottom belonged to the Employer (Jim Davis) and the Employee (Neil Thomas) based on the context from earlier in the document. This contextual understanding is only possible because we use LLMs — and prime them properly using Adlib’s custom interface.

See the two examples below.

The End Result: Clear, Accurate, and Usable Data

Once the document has been processed, the extracted information is output as a JSON file or another machine-readable format that can be fed directly into your downstream business systems. No manual reviews. No rework. Just clean, actionable data.

For example, in the case of an invoice, you get a file that looks something like this:

{
  "Customer Name": "Neil Thomas",
  "Customer Address": "123 Main Street, Toronto, ON",
  "Invoice ID": "#24210",
  "Items": [
    "Belkin Router Accessories",
    "Bluetooth Adapter"
  ],
  "Total Amount": "$350.00"
}

For employee agreements, you might extract:

{
  "Employee Name": "Neil Thomas",
  "Employer Name": "Rockwell Equity Trust",
  "Base Salary": "$105,000 per year",
  "Annual Bonus Percentage": "18%"
}

With this data, you can integrate directly into financial systems, HR platforms, and other business workflows.

What Makes Adlib Different?

Unlike traditional extraction tools, Adlib’s system is:

  • Flexible: No rigid templates that break with every layout change.
  • Context-Aware: LLMs understand concepts like "Base Salary" and "Ship To."
  • User-Friendly: Plain-English prompts make it easy for non-technical users to guide AI.
  • Fast: Process large batches of documents in seconds.

This means no more manual keying, no more rework, and no more frustration.

Final Thoughts

If you'd like to see how it works on your own documents, I’d love to show you. Reach out to your Adlib rep, and we can walk you through it.

Thanks for reading,

Kunal Bargotra, Product Manager, Adlib

News
|
January 14, 2025
Is Your ECM Smart Enough for AI?
Learn More
News
|
January 6, 2025
Why Data-Driven Process Flows Are the Future of Workflow Automation
Learn More
News
|
December 30, 2024
How Adlib Turns Unstructured Data into Smart, Actionable Content
Learn More

Schedule a workshop with our experts

Leverage the expertise of our industry experts to perform a deep-dive into your business imperatives, capabilities and desired outcomes, including business case and investment analysis.