Gemini Multimodal Document Agent Tutorial: Build and Deploy on Vultr for AI Hackathons

Build a Multimodal Document Intelligence Agent with Gemini ADK and Deploy It on Vultr
Enterprise teams spend thousands of hours manually extracting data from invoices, contracts, and scanned documents. Gemini 2.5 Flash can read any of those documents natively, and Google's Agent Development Kit (ADK) gives you a clean framework to turn that capability into a production service. This tutorial walks you through building a document extraction agent, wrapping it in a FastAPI service, containerizing it, and shipping it to a live Vultr server.
By the end you will have a public API endpoint that accepts a PDF, image, or text file and returns structured JSON with every relevant field pulled out of the document.
This kind of document intelligence agent is particularly valuable for AI hackathon 2026 teams, where shipping a working prototype in 48 hours is the whole game. Whether you are building a fintech tool, a legal document processor, or a contract analyzer, this stack gives you a production-ready foundation within a single sprint. Browse upcoming AI hackathons on LabLab.ai to see where you can apply it next.
What You'll Build
A containerized FastAPI service backed by a Google ADK agent that:
- Accepts file uploads (PDF, image, plain text)
- Identifies the document type automatically
- Calls the appropriate extraction tool (invoice, contract, or general)
- Returns clean structured JSON
Stack: Python 3.11, Google ADK 1.18, Gemini 2.5 Flash, FastAPI, Docker, Vultr Cloud Compute.
Prerequisites
- A Google AI Studio account and API key
- A Vultr account with a billing method added
- Docker installed locally
- Python 3.10 or higher
- Basic familiarity with FastAPI and async Python
Step 1: Get Your Gemini API Key
Go to Google AI Studio, sign in, and click Get API key. Create a new project if prompted, then copy the key. Keep it in a safe place as you will need it for both local development and the Vultr deployment.
Step 2: Set Up the Project
Create the project directory and install dependencies:
mkdir gemini-multimodal-document-agent
cd gemini-multimodal-document-agent
python3.10 -m venv .venv
source .venv/bin/activate
Create requirements.txt:
google-adk==1.18.0
fastapi>=0.111.0
uvicorn[standard]>=0.29.0
python-multipart>=0.0.9
pydantic>=2.7.0
python-dotenv>=1.0.0
Install:
pip install -r requirements.txt
Create a .env file:
GOOGLE_API_KEY=your_api_key_here
Add .env and .venv/ to .gitignore so they never get committed.
Your final project structure will be:
gemini-multimodal-document-agent/
├── app/
│ ├── __init__.py
│ ├── agent.py
│ ├── tools.py
│ ├── schemas.py
│ └── main.py
├── sample_docs/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── .env
Step 3: Define the Response Schemas
Create app/schemas.py:
from pydantic import BaseModel
from typing import Optional, Any
class AnalysisResponse(BaseModel):
document_type: str
filename: str
extracted_data: dict[str, Any]
summary: str
processing_notes: Optional[str] = None
View app/schemas.py on GitHub →
Step 4: Build the Extraction Tools
The ADK agent uses function-calling tools to return structured data. Each tool corresponds to a document type. When the agent reads a document, it decides which tool to call and passes every extracted field as arguments. The tool writes those arguments into the session state, which we read back after the agent finishes.
Create app/tools.py:
from typing import Optional
from google.adk.tools.tool_context import ToolContext
def save_invoice_extraction(
tool_context: ToolContext,
vendor_name: Optional[str] = None,
invoice_number: Optional[str] = None,
invoice_date: Optional[str] = None,
due_date: Optional[str] = None,
total_amount: Optional[str] = None,
currency: Optional[str] = None,
subtotal: Optional[str] = None,
tax_amount: Optional[str] = None,
line_items: Optional[list[str]] = None,
payment_terms: Optional[str] = None,
billing_address: Optional[str] = None,
notes: Optional[str] = None,
) -> str:
"""Save structured data extracted from an invoice document.
Call this tool when the document is an invoice or bill.
Args:
vendor_name: Name of the vendor or supplier issuing the invoice.
invoice_number: Unique invoice identifier or number.
invoice_date: Date the invoice was issued (ISO format preferred).
due_date: Payment due date (ISO format preferred).
total_amount: Total amount due including taxes.
currency: Currency code (e.g. USD, EUR, KES).
subtotal: Subtotal before taxes.
tax_amount: Tax amount applied.
line_items: Each item as a string: "Description | Qty | Unit Price | Total".
payment_terms: Payment terms such as Net 30, Due on receipt, etc.
billing_address: Billing address of the recipient.
notes: Any additional notes or payment instructions on the invoice.
Returns:
Confirmation string.
"""
tool_context.state["extraction_result"] = {
"document_type": "invoice",
"extracted_data": {
"vendor_name": vendor_name,
"invoice_number": invoice_number,
"invoice_date": invoice_date,
"due_date": due_date,
"total_amount": total_amount,
"currency": currency,
"subtotal": subtotal,
"tax_amount": tax_amount,
"line_items": line_items or [],
"payment_terms": payment_terms,
"billing_address": billing_address,
"notes": notes,
},
}
return "Invoice extraction saved."
def save_contract_extraction(
tool_context: ToolContext,
parties: Optional[list[str]] = None,
effective_date: Optional[str] = None,
expiration_date: Optional[str] = None,
contract_type: Optional[str] = None,
key_obligations: Optional[list[str]] = None,
termination_conditions: Optional[list[str]] = None,
governing_law: Optional[str] = None,
jurisdiction: Optional[str] = None,
) -> str:
"""Save structured data extracted from a contract or legal agreement.
Call this tool when the document is a contract, agreement, MOU, NDA, or similar legal document.
Args:
parties: List of party names involved in the contract.
effective_date: Date the contract takes effect (ISO format preferred).
expiration_date: Date the contract expires (ISO format preferred).
contract_type: Type of contract (e.g. NDA, Service Agreement, Employment Contract).
key_obligations: List of key obligations or responsibilities for each party.
termination_conditions: List of conditions under which the contract can be terminated.
governing_law: The law or jurisdiction governing the contract.
jurisdiction: The jurisdiction for resolving disputes.
Returns:
Confirmation string.
"""
tool_context.state["extraction_result"] = {
"document_type": "contract",
"extracted_data": {
"parties": parties or [],
"effective_date": effective_date,
"expiration_date": expiration_date,
"contract_type": contract_type,
"key_obligations": key_obligations or [],
"termination_conditions": termination_conditions or [],
"governing_law": governing_law,
"jurisdiction": jurisdiction,
},
}
return "Contract extraction saved."
def save_general_extraction(
tool_context: ToolContext,
document_title: Optional[str] = None,
summary: Optional[str] = None,
key_entities: Optional[list[str]] = None,
dates_mentioned: Optional[list[str]] = None,
key_figures: Optional[list[str]] = None,
main_topics: Optional[list[str]] = None,
) -> str:
"""Save structured data extracted from a general document, report, image, or plain text.
Call this tool for documents that are not invoices or contracts.
Args:
document_title: Title or inferred title of the document.
summary: 2-3 sentence summary of the document content.
key_entities: List of important people, organizations, or products mentioned.
dates_mentioned: List of any dates referenced in the document.
key_figures: List of important numbers, amounts, or statistics found.
main_topics: List of main topics or themes covered.
Returns:
Confirmation string.
"""
tool_context.state["extraction_result"] = {
"document_type": "general",
"extracted_data": {
"document_title": document_title,
"summary": summary,
"key_entities": key_entities or [],
"dates_mentioned": dates_mentioned or [],
"key_figures": key_figures or [],
"main_topics": main_topics or [],
},
}
return "General extraction saved."
Three things to note about this design:
- All list parameters are typed as
list[str]. The Gemini API requires typed list parameters in tool schemas and will reject untypedlistannotations. ToolContextis injected automatically by ADK. You do not instantiate it yourself.- The tool writes to
tool_context.state, a session-scoped dictionary we read after the agent finishes running.
Step 5: Build the ADK Agent
Create app/agent.py:
import asyncio
import uuid
from google.adk import Agent
from google.adk.runners import InMemoryRunner
from google.genai import types
from app.tools import save_invoice_extraction, save_contract_extraction, save_general_extraction
INSTRUCTION = """
You are an enterprise document intelligence agent. Your job is to analyze uploaded documents
and extract all relevant structured data from them.
When you receive a document, follow these steps:
1. Identify the document type: invoice, contract, or general (includes images and plain text).
2. Read the document carefully and extract every relevant field.
3. Call exactly ONE of the following tools with the extracted data:
- save_invoice_extraction() for invoices, bills, purchase orders, receipts
- save_contract_extraction() for contracts, agreements, NDAs, MOUs, legal documents
- save_general_extraction() for everything else: reports, images, memos, plain text
Rules:
- Extract ALL fields you can find. If a field is missing from the document, pass null.
- For line_items in invoices, format each item as: "Description | Qty | Unit Price | Total"
- For scanned images or photos of documents, read all visible text before extracting.
- Always call one of the save tools. Never respond without calling a tool.
- Be precise with amounts, dates, and names. Do not infer or guess missing values.
"""
APP_NAME = "document_agent"
def create_runner() -> InMemoryRunner:
agent = Agent(
model="gemini-2.5-flash",
name="document_agent",
description="Extracts structured data from enterprise documents.",
instruction=INSTRUCTION,
tools=[
save_invoice_extraction,
save_contract_extraction,
save_general_extraction,
],
)
return InMemoryRunner(agent=agent, app_name=APP_NAME)
async def analyze_document(
runner: InMemoryRunner,
file_bytes: bytes,
mime_type: str,
filename: str,
) -> dict:
user_id = "api_user"
session_id = str(uuid.uuid4())
await runner.session_service.create_session(
app_name=APP_NAME, user_id=user_id, session_id=session_id
)
prompt = (
f"Analyze this document (filename: {filename}) and extract all structured data. "
"Call the appropriate save tool with every field you can extract."
)
content = types.Content(
role="user",
parts=[
types.Part.from_bytes(data=file_bytes, mime_type=mime_type),
types.Part.from_text(text=prompt),
],
)
async for _ in runner.run_async(
user_id=user_id,
session_id=session_id,
new_message=content,
):
pass
session = await runner.session_service.get_session(
app_name=APP_NAME, user_id=user_id, session_id=session_id
)
result = session.state.get("extraction_result")
if not result:
return {
"document_type": "unknown",
"extracted_data": {},
"summary": "The agent could not extract structured data from this document.",
}
return result
A few things worth understanding here:
InMemoryRunner handles session management, event routing, and LLM calls. You create it once at startup and reuse it across requests.
types.Part.from_bytes passes the raw file bytes directly to Gemini. The model reads PDFs, images, and text natively without any preprocessing on your side.
run_async returns an async generator of events. We iterate through them but only care about the final session state. The tool call happens inside that iteration.
Step 6: Build the FastAPI Service
Create app/__init__.py (empty) and app/main.py:
import os
from contextlib import asynccontextmanager
from dotenv import load_dotenv
from fastapi import FastAPI, File, UploadFile, HTTPException
from app.agent import create_runner, analyze_document
from app.schemas import AnalysisResponse
load_dotenv()
SUPPORTED_MIME_TYPES = {
"application/pdf",
"image/jpeg",
"image/jpg",
"image/png",
"image/webp",
"text/plain",
"text/markdown",
}
MAX_FILE_SIZE_MB = 20
@asynccontextmanager
async def lifespan(app: FastAPI):
if not os.getenv("GOOGLE_API_KEY"):
raise RuntimeError("GOOGLE_API_KEY environment variable is not set.")
app.state.runner = create_runner()
yield
app = FastAPI(
title="Document Intelligence Agent",
description="Multimodal enterprise document extraction powered by Gemini and Google ADK.",
version="1.0.0",
lifespan=lifespan,
)
@app.get("/health")
async def health():
return {"status": "ok"}
@app.post("/analyze", response_model=AnalysisResponse)
async def analyze(file: UploadFile = File(...)):
content_type = file.content_type or ""
if content_type not in SUPPORTED_MIME_TYPES:
raise HTTPException(
status_code=415,
detail=f"Unsupported file type: {content_type}.",
)
file_bytes = await file.read()
if len(file_bytes) > MAX_FILE_SIZE_MB * 1024 * 1024:
raise HTTPException(status_code=413, detail="File too large. Max 20MB.")
if len(file_bytes) == 0:
raise HTTPException(status_code=400, detail="Uploaded file is empty.")
result = await analyze_document(
runner=app.state.runner,
file_bytes=file_bytes,
mime_type=content_type,
filename=file.filename or "document",
)
doc_type = result.get("document_type", "unknown")
extracted = result.get("extracted_data", {})
summary = _build_summary(doc_type, extracted, file.filename or "document")
return AnalysisResponse(
document_type=doc_type,
filename=file.filename or "document",
extracted_data=extracted,
summary=summary,
)
def _build_summary(doc_type: str, data: dict, filename: str) -> str:
if doc_type == "invoice":
vendor = data.get("vendor_name") or "Unknown vendor"
total = data.get("total_amount") or "unknown amount"
currency = data.get("currency") or ""
inv_num = data.get("invoice_number") or "N/A"
return f"Invoice #{inv_num} from {vendor} for {currency} {total}."
elif doc_type == "contract":
parties = data.get("parties") or []
ctype = data.get("contract_type") or "Agreement"
party_str = " and ".join(parties) if parties else "unknown parties"
return f"{ctype} between {party_str}."
else:
title = data.get("document_title") or filename
summary_text = data.get("summary") or "No summary available."
return f"{title}: {summary_text}"
Test it locally before deploying:
uvicorn app.main:app --host 0.0.0.0 --port 8000
In a second terminal:
curl http://localhost:8000/health
# {"status":"ok"}
curl -X POST http://localhost:8000/analyze \
-F "file=@sample_docs/sample_invoice.txt;type=text/plain"
You should see a full JSON response with every invoice field extracted.
FastAPI also provides an interactive docs UI at http://localhost:8000/docs where you can upload files and inspect responses without writing any curl commands.
Step 7: Containerize with Docker
Create Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ ./app/
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Create docker-compose.yml:
services:
app:
build: .
ports:
- "8000:8000"
env_file:
- .env
restart: unless-stopped
View docker-compose.yml on GitHub →
Build and verify locally:
docker compose up --build
curl http://localhost:8000/health
Step 8: Provision a Vultr Instance
You need to add a billing method to your Vultr account before deploying. Go to your account settings and add a credit or debit card first.
Then deploy an instance:
- Log in to console.vultr.com
- Click Quick Deploy (bottom left) then Instances
- Select Shared CPU
- Location: Amsterdam (good latency from most regions)
- Image: Ubuntu 24.04 LTS
- Plan: vc2-1c-1gb ($5/mo, 1 vCPU / 1GB RAM)
- Skip SSH keys for now (Vultr emails the root password)
- Hostname:
document-agent - Leave all extras unchecked (no backups, no DDoS protection)
- Click Deploy
Wait about 60 seconds for the status to change from Installing to Running. The IP address and root password appear on the instance overview page.
Step 9: Deploy the Agent on Vultr
SSH into the server:
ssh root@YOUR_VULTR_IP
# Accept the fingerprint prompt, then enter the password from the dashboard
Install Docker:
curl -fsSL https://get.docker.com | sh && systemctl enable docker && systemctl start docker
Verify:
docker --version
Back on your local machine, copy the project to the server:
scp -r ./gemini-multimodal-document-agent root@YOUR_VULTR_IP:/opt/document-agent
Back on the server, create the env file and deploy:
cd /opt/document-agent
echo "GOOGLE_API_KEY=your_api_key_here" > .env
docker compose up -d --build
The first build takes 2 to 3 minutes while Docker pulls the base image and installs the Python packages. Once it finishes:
curl http://localhost:8000/health
# {"status":"ok"}
Step 10: Test the Live API
From your local machine, send a real document to the public IP:
curl -X POST http://YOUR_VULTR_IP:8000/analyze \
-F "file=@sample_docs/sample_invoice.txt;type=text/plain"
Expected response:
{
"document_type": "invoice",
"filename": "sample_invoice.txt",
"extracted_data": {
"vendor_name": "Acme Solutions Ltd.",
"invoice_number": "INV-2026-0042",
"invoice_date": "2026-05-05",
"due_date": "2026-06-04",
"total_amount": "$6,032.00",
"currency": "USD",
"subtotal": "$5,200.00",
"tax_amount": "$832.00",
"line_items": [
"API Integration Services | 1 | $2,500.00 | $2,500.00",
"Cloud Infrastructure Setup | 1 | $1,200.00 | $1,200.00",
"Technical Consulting (10 hrs) | 10 | $150.00 | $1,500.00"
],
"payment_terms": "Net 30",
"billing_address": "TechCorp Inc.\n456 Innovation Drive, Nairobi, Kenya",
"notes": "Payment Instructions: Bank transfer to Equity Bank, Account No. 1234567890"
},
"summary": "Invoice #INV-2026-0042 from Acme Solutions Ltd. for USD $6,032.00.",
"processing_notes": null
}
Try it with a contract file to see the agent switch tools automatically:
curl -X POST http://YOUR_VULTR_IP:8000/analyze \
-F "file=@sample_docs/sample_contract.txt;type=text/plain"
The agent will call save_contract_extraction instead and return parties, obligations, termination conditions, and governing law.
What's Happening Under the Hood
When a file hits the /analyze endpoint, here is the execution path:
- FastAPI reads the file bytes and validates the MIME type
analyze_documentcreates a new ADK session and sends the file to Gemini viaInMemoryRunner- The agent reads the document using Gemini's native multimodal understanding
- Based on what it reads, the agent calls one of the three extraction tools
- The tool writes structured data into the session state
- After the agent finishes, we read that state and return it as JSON
The key design decision is that the tools do not receive the document. Gemini has already read it from the multimodal message context. The tools only receive the extracted fields as typed arguments, which forces the model to commit to specific values rather than returning freeform text.
Frequently Asked Questions
Q: What file types does the /analyze endpoint accept?
PDF, JPEG, PNG, WebP, plain text (.txt), and Markdown (.md) files up to 20 MB. For larger files, swap Part.from_bytes for the Gemini Files API and pass a file URI instead.
Q: Why must list parameters in ADK tools be typed as list[str] instead of just list?
The Gemini API generates a JSON schema from your tool's type annotations. An untyped list produces a schema without an items field, which the API rejects with a 400 INVALID_ARGUMENT error. Using list[str] (or any concrete generic) generates the required items: {type: string} field automatically.
Q: Can I use this stack in an AI hackathon 2026 project? Yes — this is the point. The Docker Compose setup deploys to any cloud instance in one command, and the ADK tool-calling pattern makes it easy to extend with new document types or swap Gemini for another model. Fork the GitHub repo and build on top of it. Check upcoming AI hackathons on LabLab.ai to find a competition to enter.
Q: How do I add support for a new document type, like purchase orders?
Create a new extraction function in app/tools.py (e.g. save_purchase_order_extraction) following the same pattern — typed parameters, ToolContext as first argument, write to tool_context.state. Then add it to the tools list in app/agent.py and update the INSTRUCTION string to tell the agent when to call it.
Next Steps
- Add a firewall — On Vultr, create a Firewall Group under Network to restrict port 8000 to trusted IPs, or put Nginx in front as a reverse proxy.
- Handle larger files — For files over 20MB, swap
Part.from_bytesfor the Gemini Files API (client.files.upload) and pass a file URI. Gemini supports PDFs up to 1,000 pages. - Add more document types — Define a new tool (
save_receipt_extraction,save_purchase_order_extraction), add it to the agent, and update the instruction to describe when to call it. - Persist results — Swap
InMemorySessionServicefor a database-backed session service and store extraction results in Postgres or Supabase.
Conclusion
You now have a working multimodal document agent running on a live Vultr server. The agent uses Google ADK's tool-calling pattern to force structured outputs from Gemini without prompt engineering tricks, and the FastAPI wrapper makes it consumable by any frontend or backend system.
The full source code for this project is available in the GitHub repository. If you want to apply this to a real hackathon project, explore the upcoming AI hackathons on LabLab.ai and bring this stack with you.
