Extract Structured Data From PDFs With AI SDK
Build a PDF invoice parser that extracts text server-side and uses LLM structured outputs to convert unstructured invoice data into typed JSON objects
Patrick the AI Engineer
Table of Contents
You need to process invoices from different vendors. Each one formats their data differently, and you need to extract invoice numbers, dates, line items, and totals into a consistent structure. We're going to build a system that parses PDF text on the server and uses the AI SDK to extract structured invoice data validated against a Zod schema.
The interesting part is combining server-side PDF parsing with client-side LLM structured outputs. The PDF stays in the browser for preview, the server extracts text without library bloat in the client bundle, and the LLM converts messy invoice text into clean typed objects.
Parsing PDFs on the Server
We'll start with a Nuxt 4 server route that accepts PDF uploads and extracts text. Install pdf-parse first:
npm install pdf-parse
Here's the basic setup:
// server/api/pdf/parse.post.ts
import { defineEventHandler, readMultipartFormData } from 'h3'
import { PDFParse } from 'pdf-parse'
export default defineEventHandler(async (event) => {
const form = await readMultipartFormData(event)
const file = form?.find(f => f.name === 'file' && 'data' in f)
const buffer = file.data as Buffer
const parser = new PDFParse({ data: buffer })
const result = await parser.getText()
return { text: result.text }
})
This reads the uploaded file from the multipart form data and passes the buffer to PDFParse. The getText() method returns an object with a text property containing all the extracted text from the PDF.
We need to clean up the parser when we're done:
export default defineEventHandler(async (event) => {
const form = await readMultipartFormData(event)
const file = form?.find(f => f.name === 'file' && 'data' in f)
const buffer = file.data as Buffer
const parser = new PDFParse({ data: buffer })
const result = await parser.getText()
await parser.destroy()
return { text: result.text }
})
The parser holds memory and file handles, so we call destroy() to release them. This matters when processing many PDFs.
Let's add page count and error handling:
export default defineEventHandler(async (event) => {
try { const form = await readMultipartFormData(event)
if (!form || form.length === 0) return { error: 'No form data received' }
const file = form.find(f => f.name === 'file' && 'data' in f)
if (!file || !('data' in file)) throw new Error('No file uploaded')
const buffer = file.data as Buffer
const parser = new PDFParse({ data: buffer })
const result = await parser.getText()
await parser.destroy()
return { text: result.text || '', numPages: result.pages?.length || 0 } } catch (err) { throw createError({ statusCode: 500, statusMessage: 'Failed to parse PDF', data: err instanceof Error ? err.message : 'Unknown error' }) }})
The pages property is an array, so we use its length to get the page count. Using Nuxt's createError returns proper HTTP error responses to the client.
Uploading and Extracting Text
Now we'll build the client that uploads PDFs and displays the extracted text. Start with file upload handling:
// client component
const extractedText = ref<string>('')
const extracting = ref(false)
async function extractPdfText(file: File) {
extracting.value = true
extractedText.value = ''
const form = new FormData()
form.append('file', file)
const res = await fetch('/api/pdf/parse', { method: 'POST', body: form })
const json = await res.json()
extractedText.value = json.text || ''
extracting.value = false
}
This creates a FormData object, appends the PDF file, and posts it to our server route. The response contains the extracted text that we store in a ref.
Add error handling for the network request:
async function extractPdfText(file: File) {
try { extracting.value = true
extractedText.value = ''
const form = new FormData()
form.append('file', file)
const res = await fetch('/api/pdf/parse', { method: 'POST', body: form })
const json = await res.json()
if (!res.ok || json.error) { console.error('Failed to extract text') return }
extractedText.value = json.text || ''
} catch { console.error('Failed to extract text') } finally { extracting.value = false }}
The finally block ensures we reset the loading state even if something fails. Checking res.ok catches HTTP errors from the server.
Defining the Invoice Schema
Before we can extract structured data, we need to define what structure we want. Here's a Zod schema for invoices:
import { z } from 'zod'
const invoiceSchema = z.object({
invoiceNumber: z.string().min(1),
invoiceDate: z.string().min(1),
vendorName: z.string().min(1),
total: z.number()
})
type InvoiceData = z.infer<typeof invoiceSchema>
This defines the minimal fields every invoice should have. The z.infer utility generates a TypeScript type from the schema, so we get validation and types from a single source.
Let's add optional fields and descriptions:
const invoiceSchema = z.object({
invoiceNumber: z.string().min(1).describe('Invoice identifier number'), invoiceDate: z.string().min(1).describe('Invoice date'), dueDate: z.string().optional().describe('Payment due date'), vendorName: z.string().min(1).describe('Vendor/seller name'), vendorAddress: z.string().optional().describe('Vendor address'), customerName: z.string().optional().describe('Customer/buyer name'), subtotal: z.number().optional().describe('Subtotal before tax'), tax: z.number().optional().describe('Tax amount'), total: z.number().describe('Total amount') })
The .describe() method adds field documentation that the LLM can see. This helps the model understand what to extract. Making fields optional with .optional() means the LLM won't fail if a field is missing from the invoice.
Add line items as a nested array:
const invoiceSchema = z.object({
invoiceNumber: z.string().min(1).describe('Invoice identifier number'),
invoiceDate: z.string().min(1).describe('Invoice date'),
dueDate: z.string().optional().describe('Payment due date'),
vendorName: z.string().min(1).describe('Vendor/seller name'),
vendorAddress: z.string().optional().describe('Vendor address'),
customerName: z.string().optional().describe('Customer/buyer name'),
subtotal: z.number().optional().describe('Subtotal before tax'),
tax: z.number().optional().describe('Tax amount'),
total: z.number().describe('Total amount'),
lineItems: z.array(z.object({ description: z.string().describe('Item description'), quantity: z.number().optional().describe('Quantity'), unitPrice: z.number().optional().describe('Unit price'), amount: z.number().optional().describe('Line item total') })).optional().describe('Invoice line items') })
The nested schema defines the structure for each line item. Making the array itself optional handles invoices that don't include line item details.
Extracting Structured Data
Now we use the AI SDK to convert the extracted text into structured JSON. Install the AI SDK and OpenAI provider:
npm install ai @ai-sdk/openai zod
Here's the basic extraction:
import { generateObject } from 'ai'
import { createOpenAI } from '@ai-sdk/openai'
async function analyzeInvoice(text: string, apiKey: string) {
const openai = createOpenAI({ apiKey })
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: invoiceSchema,
prompt: `Extract invoice data: ${text}`
})
return object
}
The AI SDK's generateObject takes a schema and returns a typed object that matches it. The model sees the schema as JSON Schema and structures its output accordingly. The returned object is automatically typed as InvoiceData thanks to Zod's type inference.
Let's clean up the text before sending it:
async function analyzeInvoice(text: string, apiKey: string) {
const openai = createOpenAI({ apiKey })
const cleanedText = text .replace(/\\n/g, '\n') .replace(/\n+/g, ' ') .replace(/\s+/g, ' ') .trim()
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: invoiceSchema,
prompt: `Extract invoice data: ${cleanedText}` })
return object
}
PDF parsers often include escaped newlines and excessive whitespace. Converting \n to actual newlines, then collapsing multiple newlines and spaces makes the text cleaner for the LLM. This reduces token count and makes extraction more accurate.
Add a system message and structured prompt:
async function analyzeInvoice(text: string, apiKey: string) {
const openai = createOpenAI({ apiKey })
const cleanedText = text
.replace(/\\n/g, '\n')
.replace(/\n+/g, ' ')
.replace(/\s+/g, ' ')
.trim()
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: invoiceSchema,
system: 'You are a precise invoice data extraction assistant. Extract all available invoice information accurately. If a field is not present in the document, omit it.', prompt: `Extract structured invoice data from the following invoice text: <invoice>${cleanedText}</invoice>`, temperature: 0 })
return object
}
The system message tells the model what its job is. Wrapping the text in <invoice> tags helps the model identify the content boundary. Setting temperature: 0 makes the model deterministic, giving consistent results for the same invoice.
Wiring It Together
Let's connect the upload, extraction, and analysis. Here's the Vue component structure:
<script setup lang="ts">
import { ref } from 'vue'
import { generateObject } from 'ai'
import { createOpenAI } from '@ai-sdk/openai'
import { z } from 'zod'
const extractedText = ref<string>('')
const analysis = ref<InvoiceData | null>(null)
const analyzing = ref(false)
async function handleFileUpload(e: Event) {
const file = (e.target as HTMLInputElement).files?.[0]
if (!file) return
await extractPdfText(file)
}
async function analyzeInvoice() {
analyzing.value = true
const openai = createOpenAI({ apiKey: 'your-key' })
const cleanedText = extractedText.value
.replace(/\\n/g, '\n')
.replace(/\n+/g, ' ')
.replace(/\s+/g, ' ')
.trim()
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: invoiceSchema,
system: 'You are a precise invoice data extraction assistant.',
prompt: `Extract structured invoice data: <invoice>${cleanedText}</invoice>`,
temperature: 0
})
analysis.value = object
analyzing.value = false
}
</script>
This connects the file upload to text extraction, then provides an analyzeInvoice function that the user triggers after reviewing the extracted text. Keeping these separate lets users verify the text extraction worked before paying for an LLM call.
Add the template with two-column layout:
<template>
<div>
<input
type="file"
accept="application/pdf"
@change="handleFileUpload"
>
<div v-if="extractedText">
<h4>Extracted Text</h4>
<pre>{{ extractedText }}</pre>
<button @click="analyzeInvoice" :disabled="analyzing">
{{ analyzing ? 'Analyzing...' : 'Analyze Invoice' }}
</button>
</div>
<div v-if="analysis">
<h4>Extracted Invoice Data</h4>
<pre>{{ JSON.stringify(analysis, null, 2) }}</pre>
</div>
</div>
</template>
The button is disabled while analyzing, and we show the loading state in the button text. The analysis result displays as formatted JSON below the extracted text.
Error Handling and Edge Cases
Add error handling for the LLM call:
async function analyzeInvoice() {
if (!extractedText.value) {
console.error('No text to analyze')
return
}
analyzing.value = true
try { const openai = createOpenAI({ apiKey: 'your-key' })
const cleanedText = extractedText.value
.replace(/\\n/g, '\n')
.replace(/\n+/g, ' ')
.replace(/\s+/g, ' ')
.trim()
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: invoiceSchema,
system: 'You are a precise invoice data extraction assistant.',
prompt: `Extract structured invoice data: <invoice>${cleanedText}</invoice>`,
temperature: 0
})
analysis.value = object
} catch (err) { console.error('Failed to analyze invoice:', err) analysis.value = null } finally { analyzing.value = false }}
The AI SDK throws errors for network failures, invalid schemas, or when the model can't generate valid output. Catching these and resetting the analysis state prevents the UI from showing stale data.
Limit the text length to avoid token limits:
const cleanedText = extractedText.value
.replace(/\\n/g, '\n')
.replace(/\n+/g, ' ')
.replace(/\s+/g, ' ')
.trim()
.slice(0, 120000) At roughly 4 characters per token, 120,000 characters is about 30,000 tokens. This leaves room in the context window for the schema and response while handling very long invoices.
Displaying the PDF Alongside Results
Instead of just showing text, display the actual PDF next to the analysis:
<template>
<div class="layout">
<div class="pdf-column">
<object
v-if="pdfUrl"
:data="pdfUrl"
type="application/pdf"
>
PDF not supported
</object>
</div>
<div class="data-column">
<div v-if="extractedText">
<h4>Extracted Text</h4>
<pre>{{ extractedText }}</pre>
<button @click="analyzeInvoice">Analyze</button>
</div>
<div v-if="analysis">
<h4>Invoice Data</h4>
<pre>{{ JSON.stringify(analysis, null, 2) }}</pre>
</div>
</div>
</div>
</template>
<style scoped>
.layout {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1rem;
}
.pdf-column object {
width: 100%;
height: 600px;
}
.data-column {
overflow-y: auto;
max-height: 800px;
}
</style>
The <object> element embeds the PDF directly in the page. Create a blob URL from the uploaded file:
const pdfUrl = ref<string | null>(null)
async function handleFileUpload(e: Event) {
const file = (e.target as HTMLInputElement).files?.[0]
if (!file) return
if (pdfUrl.value) { URL.revokeObjectURL(pdfUrl.value) }
pdfUrl.value = URL.createObjectURL(file) await extractPdfText(file)
}
createObjectURL generates a temporary URL pointing to the file in memory. Revoking the old URL before creating a new one prevents memory leaks when uploading multiple PDFs.
This gives you the PDF on the left and results on the right. You can verify the extraction by comparing the raw text and structured data against what you see in the PDF.
Using a Composable for API Keys
Instead of hardcoding API keys, use a composable to manage them:
// composables/useApiKeys.ts
const openaiKey = ref<string | null>(null)
export function useApiKeys() {
return {
openaiKey,
setKey: (key: string) => {
openaiKey.value = key
}
}
}
Then in your component:
const { openaiKey } = useApiKeys()
async function analyzeInvoice() {
if (!openaiKey.value) {
console.error('API key not set')
return
}
const openai = createOpenAI({ apiKey: openaiKey.value })
// rest of the function...
}
This lets you build a settings UI where users enter their own API keys. The keys never leave the browser and aren't stored on your server.
The AI SDK's generateObject handles schema validation automatically. If the model returns invalid JSON or data that doesn't match the schema, the SDK throws an error. You get type-safe, validated objects without writing validation code.
For production use, you'd want to store API keys encrypted in localStorage, add token usage tracking, and handle multi-page PDFs differently depending on whether you need to analyze each page separately or as one document. The basic pattern works for invoices, receipts, contracts, or any document with structured data buried in unstructured text.
<script setup lang="ts">
import { ref, onBeforeUnmount } from 'vue'
import { generateObject } from 'ai'
import { createOpenAI } from '@ai-sdk/openai'
import { z } from 'zod'
import { useApiKeys } from '~/composables/useApiKeys'
const fileUploadRef = ref<HTMLInputElement>()
const pdfUrl = ref<string | null>(null)
const extractedText = ref<string>('')
const extracting = ref(false)
const analyzing = ref(false)
const analysis = ref<InvoiceData | null>(null)
const invoiceSchema = z.object({
invoiceNumber: z.string().min(1).describe('Invoice identifier number'),
invoiceDate: z.string().min(1).describe('Invoice date'),
dueDate: z.string().optional().describe('Payment due date'),
vendorName: z.string().min(1).describe('Vendor/seller name'),
vendorAddress: z.string().optional().describe('Vendor address'),
customerName: z.string().optional().describe('Customer/buyer name'),
customerAddress: z.string().optional().describe('Customer address'),
currency: z.string().optional().describe('Currency code'),
subtotal: z.number().optional().describe('Subtotal amount before tax'),
tax: z.number().optional().describe('Tax amount'),
total: z.number().describe('Total amount'),
purchaseOrderNumber: z.string().optional().describe('Purchase order number'),
paymentTerms: z.string().optional().describe('Payment terms'),
lineItems: z.array(z.object({
description: z.string().describe('Item description'),
quantity: z.number().optional().describe('Quantity'),
unitPrice: z.number().optional().describe('Unit price'),
amount: z.number().optional().describe('Line item total')
})).optional().describe('Invoice line items')
})
type InvoiceData = z.infer<typeof invoiceSchema>
const { openaiKey } = useApiKeys()
async function handleFileUpload(e: Event) {
const target = e.target as HTMLInputElement
const file = target.files?.[0]
if (!file) return
if (pdfUrl.value) {
URL.revokeObjectURL(pdfUrl.value)
}
pdfUrl.value = URL.createObjectURL(file)
await extractPdfText(file)
}
async function extractPdfText(file: File) {
try {
extracting.value = true
extractedText.value = ''
const form = new FormData()
form.append('file', file)
const res = await fetch('/api/pdf/parse', { method: 'POST', body: form })
const json = await res.json()
if (!res.ok || json.error) {
console.error('Failed to extract text from PDF')
return
}
extractedText.value = json.text || ''
} catch {
console.error('Failed to extract text from PDF')
} finally {
extracting.value = false
}
}
async function analyzeInvoice() {
if (!extractedText.value || !openaiKey.value) return
analyzing.value = true
try {
const openai = createOpenAI({ apiKey: openaiKey.value })
const cleanedText = extractedText.value
.replace(/\\n/g, '\n')
.replace(/\n+/g, ' ')
.replace(/\s+/g, ' ')
.trim()
.slice(0, 120000)
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: invoiceSchema,
system: 'You are a precise invoice data extraction assistant. Extract all available invoice information accurately. If a field is not present in the document, omit it.',
prompt: `Extract structured invoice data from the following invoice text: <invoice>${cleanedText}</invoice>`,
temperature: 0
})
analysis.value = object
} catch (err) {
console.error('Failed to analyze invoice:', err)
analysis.value = null
} finally {
analyzing.value = false
}
}
onBeforeUnmount(() => {
if (pdfUrl.value) {
URL.revokeObjectURL(pdfUrl.value)
}
})
</script>
<template>
<div>
<input
ref="fileUploadRef"
type="file"
accept="application/pdf"
@change="handleFileUpload"
>
<div v-if="pdfUrl" class="layout">
<div class="pdf-column">
<object
:data="pdfUrl"
type="application/pdf"
>
PDF preview not supported in this browser.
</object>
</div>
<div class="data-column">
<div v-if="extracting">
Extracting text…
</div>
<div v-if="extractedText">
<h4>Extracted Text</h4>
<pre>{{ extractedText }}</pre>
<button
:disabled="analyzing || !openaiKey"
@click="analyzeInvoice"
>
{{ analyzing ? 'Analyzing...' : 'Analyze Invoice' }}
</button>
</div>
<div v-if="analysis">
<h4>AI Extracted Invoice</h4>
<pre>{{ JSON.stringify(analysis, null, 2) }}</pre>
</div>
</div>
</div>
</div>
</template>
<style scoped>
.layout {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1rem;
margin-top: 1rem;
}
.pdf-column object {
width: 100%;
height: 600px;
border: 2px solid #e5e7eb;
border-radius: 0.5rem;
}
.data-column {
overflow-y: auto;
max-height: 800px;
}
pre {
white-space: pre-wrap;
word-wrap: break-word;
max-height: 300px;
overflow: auto;
padding: 0.75rem;
background: #f9fafb;
border: 1px solid #e5e7eb;
border-radius: 0.375rem;
font-size: 13px;
}
button {
margin-top: 0.75rem;
padding: 10px 16px;
background: #2563eb;
color: white;
border: none;
border-radius: 6px;
cursor: pointer;
}
button:disabled {
background: #9ca3af;
cursor: not-allowed;
}
</style>
import { generateObject } from 'ai'
import { createOpenAI } from '@ai-sdk/openai'
import { z } from 'zod'
import { PDFParse } from 'pdf-parse'
import { readFile } from 'node:fs/promises'
const invoiceSchema = z.object({
invoiceNumber: z.string().min(1).describe('Invoice identifier number'),
invoiceDate: z.string().min(1).describe('Invoice date'),
dueDate: z.string().optional().describe('Payment due date'),
vendorName: z.string().min(1).describe('Vendor/seller name'),
vendorAddress: z.string().optional().describe('Vendor address'),
customerName: z.string().optional().describe('Customer/buyer name'),
customerAddress: z.string().optional().describe('Customer address'),
currency: z.string().optional().describe('Currency code'),
subtotal: z.number().optional().describe('Subtotal amount before tax'),
tax: z.number().optional().describe('Tax amount'),
total: z.number().describe('Total amount'),
purchaseOrderNumber: z.string().optional().describe('Purchase order number'),
paymentTerms: z.string().optional().describe('Payment terms'),
lineItems: z.array(z.object({
description: z.string().describe('Item description'),
quantity: z.number().optional().describe('Quantity'),
unitPrice: z.number().optional().describe('Unit price'),
amount: z.number().optional().describe('Line item total')
})).optional().describe('Invoice line items')
})
type InvoiceData = z.infer<typeof invoiceSchema>
async function extractTextFromPdf(filePath: string): Promise<string> {
const buffer = await readFile(filePath)
const parser = new PDFParse({ data: buffer })
const result = await parser.getText()
await parser.destroy()
return result.text || ''
}
async function extractInvoiceData(
text: string,
apiKey: string
): Promise<InvoiceData> {
const openai = createOpenAI({ apiKey })
const cleanedText = text
.replace(/\\n/g, '\n')
.replace(/\n+/g, ' ')
.replace(/\s+/g, ' ')
.trim()
.slice(0, 120000)
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: invoiceSchema,
system: 'You are a precise invoice data extraction assistant. Extract all available invoice information accurately. If a field is not present in the document, omit it.',
prompt: `Extract structured invoice data from the following invoice text: <invoice>${cleanedText}</invoice>`,
temperature: 0
})
return object
}
// Usage
async function processInvoice(pdfPath: string, apiKey: string) {
const text = await extractTextFromPdf(pdfPath)
const invoice = await extractInvoiceData(text, apiKey)
console.log('Extracted invoice:', invoice)
return invoice
}
// Run it
processInvoice('./invoice.pdf', process.env.OPENAI_API_KEY!)
// server/api/pdf/parse.post.ts
import { defineEventHandler, readMultipartFormData } from 'h3'
import { PDFParse } from 'pdf-parse'
export default defineEventHandler(async (event) => {
try {
const form = await readMultipartFormData(event)
if (!form || form.length === 0) {
return { error: 'No form data received' }
}
const file = form.find(f => f.name === 'file' && 'data' in f)
if (!file || !('data' in file)) {
throw new Error('No file uploaded (expecting field name "file")')
}
const buffer = file.data as Buffer
const parser = new PDFParse({ data: buffer })
const textResult = await parser.getText()
await parser.destroy()
const text: string = textResult.text || ''
const numPages: number = textResult.pages?.length || 0
return { text, numPages }
} catch (err) {
throw createError({
statusCode: 500,
statusMessage: 'Failed to parse PDF',
data: err instanceof Error ? err.message : 'Unknown error'
})
}
})

Extract Structured Data From PDF Images With Vision Models
Render PDF pages as screenshots and use OpenAI's vision models with structured outputs to extract typed invoice data from scanned documents and complex layouts
Fine-Grained Image Editing with Gemini
Learn how to build an image editor that modifies specific regions using Gemini's image generation API, with practical TypeScript and Vue.js implementations.