Gemini Multimodal Document Agent Tutorial: Build and Deploy on Vultr for AI Hackathons

Friday, May 08, 2026 by kimoisteve
Gemini Multimodal Document Agent Tutorial: Build and Deploy on Vultr for AI Hackathons

Build a Multimodal Document Intelligence Agent with Gemini ADK and Deploy It on Vultr

Enterprise teams spend thousands of hours manually extracting data from invoices, contracts, and scanned documents. Gemini 2.5 Flash can read any of those documents natively, and Google's Agent Development Kit (ADK) gives you a clean framework to turn that capability into a production service. This tutorial walks you through building a document extraction agent, wrapping it in a FastAPI service, containerizing it, and shipping it to a live Vultr server.

By the end you will have a public API endpoint that accepts a PDF, image, or text file and returns structured JSON with every relevant field pulled out of the document.

This kind of document intelligence agent is particularly valuable for AI hackathon 2026 teams, where shipping a working prototype in 48 hours is the whole game. Whether you are building a fintech tool, a legal document processor, or a contract analyzer, this stack gives you a production-ready foundation within a single sprint. Browse upcoming AI hackathons on LabLab.ai to see where you can apply it next.

What You'll Build

A containerized FastAPI service backed by a Google ADK agent that:

  • Accepts file uploads (PDF, image, plain text)
  • Identifies the document type automatically
  • Calls the appropriate extraction tool (invoice, contract, or general)
  • Returns clean structured JSON

Stack: Python 3.11, Google ADK 1.18, Gemini 2.5 Flash, FastAPI, Docker, Vultr Cloud Compute.

Prerequisites

  • A Google AI Studio account and API key
  • A Vultr account with a billing method added
  • Docker installed locally
  • Python 3.10 or higher
  • Basic familiarity with FastAPI and async Python

Step 1: Get Your Gemini API Key

Go to Google AI Studio, sign in, and click Get API key. Create a new project if prompted, then copy the key. Keep it in a safe place as you will need it for both local development and the Vultr deployment.

Step 2: Set Up the Project

Create the project directory and install dependencies:

mkdir gemini-multimodal-document-agent
cd gemini-multimodal-document-agent
python3.10 -m venv .venv
source .venv/bin/activate

Create requirements.txt:

google-adk==1.18.0
fastapi>=0.111.0
uvicorn[standard]>=0.29.0
python-multipart>=0.0.9
pydantic>=2.7.0
python-dotenv>=1.0.0

Install:

pip install -r requirements.txt

Create a .env file:

GOOGLE_API_KEY=your_api_key_here

Add .env and .venv/ to .gitignore so they never get committed.

Your final project structure will be:

gemini-multimodal-document-agent/
├── app/
│   ├── __init__.py
│   ├── agent.py
│   ├── tools.py
│   ├── schemas.py
│   └── main.py
├── sample_docs/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── .env

Step 3: Define the Response Schemas

Create app/schemas.py:

from pydantic import BaseModel
from typing import Optional, Any


class AnalysisResponse(BaseModel):
    document_type: str
    filename: str
    extracted_data: dict[str, Any]
    summary: str
    processing_notes: Optional[str] = None

View app/schemas.py on GitHub →

Step 4: Build the Extraction Tools

The ADK agent uses function-calling tools to return structured data. Each tool corresponds to a document type. When the agent reads a document, it decides which tool to call and passes every extracted field as arguments. The tool writes those arguments into the session state, which we read back after the agent finishes.

Create app/tools.py:

from typing import Optional
from google.adk.tools.tool_context import ToolContext


def save_invoice_extraction(
    tool_context: ToolContext,
    vendor_name: Optional[str] = None,
    invoice_number: Optional[str] = None,
    invoice_date: Optional[str] = None,
    due_date: Optional[str] = None,
    total_amount: Optional[str] = None,
    currency: Optional[str] = None,
    subtotal: Optional[str] = None,
    tax_amount: Optional[str] = None,
    line_items: Optional[list[str]] = None,
    payment_terms: Optional[str] = None,
    billing_address: Optional[str] = None,
    notes: Optional[str] = None,
) -> str:
    """Save structured data extracted from an invoice document.

    Call this tool when the document is an invoice or bill.

    Args:
        vendor_name: Name of the vendor or supplier issuing the invoice.
        invoice_number: Unique invoice identifier or number.
        invoice_date: Date the invoice was issued (ISO format preferred).
        due_date: Payment due date (ISO format preferred).
        total_amount: Total amount due including taxes.
        currency: Currency code (e.g. USD, EUR, KES).
        subtotal: Subtotal before taxes.
        tax_amount: Tax amount applied.
        line_items: Each item as a string: "Description | Qty | Unit Price | Total".
        payment_terms: Payment terms such as Net 30, Due on receipt, etc.
        billing_address: Billing address of the recipient.
        notes: Any additional notes or payment instructions on the invoice.

    Returns:
        Confirmation string.
    """
    tool_context.state["extraction_result"] = {
        "document_type": "invoice",
        "extracted_data": {
            "vendor_name": vendor_name,
            "invoice_number": invoice_number,
            "invoice_date": invoice_date,
            "due_date": due_date,
            "total_amount": total_amount,
            "currency": currency,
            "subtotal": subtotal,
            "tax_amount": tax_amount,
            "line_items": line_items or [],
            "payment_terms": payment_terms,
            "billing_address": billing_address,
            "notes": notes,
        },
    }
    return "Invoice extraction saved."


def save_contract_extraction(
    tool_context: ToolContext,
    parties: Optional[list[str]] = None,
    effective_date: Optional[str] = None,
    expiration_date: Optional[str] = None,
    contract_type: Optional[str] = None,
    key_obligations: Optional[list[str]] = None,
    termination_conditions: Optional[list[str]] = None,
    governing_law: Optional[str] = None,
    jurisdiction: Optional[str] = None,
) -> str:
    """Save structured data extracted from a contract or legal agreement.

    Call this tool when the document is a contract, agreement, MOU, NDA, or similar legal document.

    Args:
        parties: List of party names involved in the contract.
        effective_date: Date the contract takes effect (ISO format preferred).
        expiration_date: Date the contract expires (ISO format preferred).
        contract_type: Type of contract (e.g. NDA, Service Agreement, Employment Contract).
        key_obligations: List of key obligations or responsibilities for each party.
        termination_conditions: List of conditions under which the contract can be terminated.
        governing_law: The law or jurisdiction governing the contract.
        jurisdiction: The jurisdiction for resolving disputes.

    Returns:
        Confirmation string.
    """
    tool_context.state["extraction_result"] = {
        "document_type": "contract",
        "extracted_data": {
            "parties": parties or [],
            "effective_date": effective_date,
            "expiration_date": expiration_date,
            "contract_type": contract_type,
            "key_obligations": key_obligations or [],
            "termination_conditions": termination_conditions or [],
            "governing_law": governing_law,
            "jurisdiction": jurisdiction,
        },
    }
    return "Contract extraction saved."


def save_general_extraction(
    tool_context: ToolContext,
    document_title: Optional[str] = None,
    summary: Optional[str] = None,
    key_entities: Optional[list[str]] = None,
    dates_mentioned: Optional[list[str]] = None,
    key_figures: Optional[list[str]] = None,
    main_topics: Optional[list[str]] = None,
) -> str:
    """Save structured data extracted from a general document, report, image, or plain text.

    Call this tool for documents that are not invoices or contracts.

    Args:
        document_title: Title or inferred title of the document.
        summary: 2-3 sentence summary of the document content.
        key_entities: List of important people, organizations, or products mentioned.
        dates_mentioned: List of any dates referenced in the document.
        key_figures: List of important numbers, amounts, or statistics found.
        main_topics: List of main topics or themes covered.

    Returns:
        Confirmation string.
    """
    tool_context.state["extraction_result"] = {
        "document_type": "general",
        "extracted_data": {
            "document_title": document_title,
            "summary": summary,
            "key_entities": key_entities or [],
            "dates_mentioned": dates_mentioned or [],
            "key_figures": key_figures or [],
            "main_topics": main_topics or [],
        },
    }
    return "General extraction saved."

View app/tools.py on GitHub →

Three things to note about this design:

  1. All list parameters are typed as list[str]. The Gemini API requires typed list parameters in tool schemas and will reject untyped list annotations.
  2. ToolContext is injected automatically by ADK. You do not instantiate it yourself.
  3. The tool writes to tool_context.state, a session-scoped dictionary we read after the agent finishes running.

Step 5: Build the ADK Agent

Create app/agent.py:

import asyncio
import uuid

from google.adk import Agent
from google.adk.runners import InMemoryRunner
from google.genai import types

from app.tools import save_invoice_extraction, save_contract_extraction, save_general_extraction

INSTRUCTION = """
You are an enterprise document intelligence agent. Your job is to analyze uploaded documents
and extract all relevant structured data from them.

When you receive a document, follow these steps:
1. Identify the document type: invoice, contract, or general (includes images and plain text).
2. Read the document carefully and extract every relevant field.
3. Call exactly ONE of the following tools with the extracted data:
   - save_invoice_extraction() for invoices, bills, purchase orders, receipts
   - save_contract_extraction() for contracts, agreements, NDAs, MOUs, legal documents
   - save_general_extraction() for everything else: reports, images, memos, plain text

Rules:
- Extract ALL fields you can find. If a field is missing from the document, pass null.
- For line_items in invoices, format each item as: "Description | Qty | Unit Price | Total"
- For scanned images or photos of documents, read all visible text before extracting.
- Always call one of the save tools. Never respond without calling a tool.
- Be precise with amounts, dates, and names. Do not infer or guess missing values.
"""

APP_NAME = "document_agent"


def create_runner() -> InMemoryRunner:
    agent = Agent(
        model="gemini-2.5-flash",
        name="document_agent",
        description="Extracts structured data from enterprise documents.",
        instruction=INSTRUCTION,
        tools=[
            save_invoice_extraction,
            save_contract_extraction,
            save_general_extraction,
        ],
    )
    return InMemoryRunner(agent=agent, app_name=APP_NAME)


async def analyze_document(
    runner: InMemoryRunner,
    file_bytes: bytes,
    mime_type: str,
    filename: str,
) -> dict:
    user_id = "api_user"
    session_id = str(uuid.uuid4())

    await runner.session_service.create_session(
        app_name=APP_NAME, user_id=user_id, session_id=session_id
    )

    prompt = (
        f"Analyze this document (filename: {filename}) and extract all structured data. "
        "Call the appropriate save tool with every field you can extract."
    )

    content = types.Content(
        role="user",
        parts=[
            types.Part.from_bytes(data=file_bytes, mime_type=mime_type),
            types.Part.from_text(text=prompt),
        ],
    )

    async for _ in runner.run_async(
        user_id=user_id,
        session_id=session_id,
        new_message=content,
    ):
        pass

    session = await runner.session_service.get_session(
        app_name=APP_NAME, user_id=user_id, session_id=session_id
    )

    result = session.state.get("extraction_result")
    if not result:
        return {
            "document_type": "unknown",
            "extracted_data": {},
            "summary": "The agent could not extract structured data from this document.",
        }

    return result

A few things worth understanding here:

InMemoryRunner handles session management, event routing, and LLM calls. You create it once at startup and reuse it across requests.

types.Part.from_bytes passes the raw file bytes directly to Gemini. The model reads PDFs, images, and text natively without any preprocessing on your side.

run_async returns an async generator of events. We iterate through them but only care about the final session state. The tool call happens inside that iteration.

View app/agent.py on GitHub →

Step 6: Build the FastAPI Service

Create app/__init__.py (empty) and app/main.py:

import os
from contextlib import asynccontextmanager

from dotenv import load_dotenv
from fastapi import FastAPI, File, UploadFile, HTTPException

from app.agent import create_runner, analyze_document
from app.schemas import AnalysisResponse

load_dotenv()

SUPPORTED_MIME_TYPES = {
    "application/pdf",
    "image/jpeg",
    "image/jpg",
    "image/png",
    "image/webp",
    "text/plain",
    "text/markdown",
}

MAX_FILE_SIZE_MB = 20


@asynccontextmanager
async def lifespan(app: FastAPI):
    if not os.getenv("GOOGLE_API_KEY"):
        raise RuntimeError("GOOGLE_API_KEY environment variable is not set.")
    app.state.runner = create_runner()
    yield


app = FastAPI(
    title="Document Intelligence Agent",
    description="Multimodal enterprise document extraction powered by Gemini and Google ADK.",
    version="1.0.0",
    lifespan=lifespan,
)


@app.get("/health")
async def health():
    return {"status": "ok"}


@app.post("/analyze", response_model=AnalysisResponse)
async def analyze(file: UploadFile = File(...)):
    content_type = file.content_type or ""

    if content_type not in SUPPORTED_MIME_TYPES:
        raise HTTPException(
            status_code=415,
            detail=f"Unsupported file type: {content_type}.",
        )

    file_bytes = await file.read()

    if len(file_bytes) > MAX_FILE_SIZE_MB * 1024 * 1024:
        raise HTTPException(status_code=413, detail="File too large. Max 20MB.")

    if len(file_bytes) == 0:
        raise HTTPException(status_code=400, detail="Uploaded file is empty.")

    result = await analyze_document(
        runner=app.state.runner,
        file_bytes=file_bytes,
        mime_type=content_type,
        filename=file.filename or "document",
    )

    doc_type = result.get("document_type", "unknown")
    extracted = result.get("extracted_data", {})
    summary = _build_summary(doc_type, extracted, file.filename or "document")

    return AnalysisResponse(
        document_type=doc_type,
        filename=file.filename or "document",
        extracted_data=extracted,
        summary=summary,
    )


def _build_summary(doc_type: str, data: dict, filename: str) -> str:
    if doc_type == "invoice":
        vendor = data.get("vendor_name") or "Unknown vendor"
        total = data.get("total_amount") or "unknown amount"
        currency = data.get("currency") or ""
        inv_num = data.get("invoice_number") or "N/A"
        return f"Invoice #{inv_num} from {vendor} for {currency} {total}."
    elif doc_type == "contract":
        parties = data.get("parties") or []
        ctype = data.get("contract_type") or "Agreement"
        party_str = " and ".join(parties) if parties else "unknown parties"
        return f"{ctype} between {party_str}."
    else:
        title = data.get("document_title") or filename
        summary_text = data.get("summary") or "No summary available."
        return f"{title}: {summary_text}"

Test it locally before deploying:

uvicorn app.main:app --host 0.0.0.0 --port 8000

In a second terminal:

curl http://localhost:8000/health
# {"status":"ok"}

curl -X POST http://localhost:8000/analyze \
  -F "file=@sample_docs/sample_invoice.txt;type=text/plain"

You should see a full JSON response with every invoice field extracted.

FastAPI also provides an interactive docs UI at http://localhost:8000/docs where you can upload files and inspect responses without writing any curl commands.

View app/main.py on GitHub →

Step 7: Containerize with Docker

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app/ ./app/

EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

View Dockerfile on GitHub →

Create docker-compose.yml:

services:
  app:
    build: .
    ports:
      - "8000:8000"
    env_file:
      - .env
    restart: unless-stopped

View docker-compose.yml on GitHub →

Build and verify locally:

docker compose up --build
curl http://localhost:8000/health

Step 8: Provision a Vultr Instance

You need to add a billing method to your Vultr account before deploying. Go to your account settings and add a credit or debit card first.

Then deploy an instance:

  1. Log in to console.vultr.com
  2. Click Quick Deploy (bottom left) then Instances
  3. Select Shared CPU
  4. Location: Amsterdam (good latency from most regions)
  5. Image: Ubuntu 24.04 LTS
  6. Plan: vc2-1c-1gb ($5/mo, 1 vCPU / 1GB RAM)
  7. Skip SSH keys for now (Vultr emails the root password)
  8. Hostname: document-agent
  9. Leave all extras unchecked (no backups, no DDoS protection)
  10. Click Deploy

Wait about 60 seconds for the status to change from Installing to Running. The IP address and root password appear on the instance overview page.

Step 9: Deploy the Agent on Vultr

SSH into the server:

ssh root@YOUR_VULTR_IP
# Accept the fingerprint prompt, then enter the password from the dashboard

Install Docker:

curl -fsSL https://get.docker.com | sh && systemctl enable docker && systemctl start docker

Verify:

docker --version

Back on your local machine, copy the project to the server:

scp -r ./gemini-multimodal-document-agent root@YOUR_VULTR_IP:/opt/document-agent

Back on the server, create the env file and deploy:

cd /opt/document-agent
echo "GOOGLE_API_KEY=your_api_key_here" > .env
docker compose up -d --build

The first build takes 2 to 3 minutes while Docker pulls the base image and installs the Python packages. Once it finishes:

curl http://localhost:8000/health
# {"status":"ok"}

Step 10: Test the Live API

From your local machine, send a real document to the public IP:

curl -X POST http://YOUR_VULTR_IP:8000/analyze \
  -F "file=@sample_docs/sample_invoice.txt;type=text/plain"

Expected response:

{
  "document_type": "invoice",
  "filename": "sample_invoice.txt",
  "extracted_data": {
    "vendor_name": "Acme Solutions Ltd.",
    "invoice_number": "INV-2026-0042",
    "invoice_date": "2026-05-05",
    "due_date": "2026-06-04",
    "total_amount": "$6,032.00",
    "currency": "USD",
    "subtotal": "$5,200.00",
    "tax_amount": "$832.00",
    "line_items": [
      "API Integration Services | 1 | $2,500.00 | $2,500.00",
      "Cloud Infrastructure Setup | 1 | $1,200.00 | $1,200.00",
      "Technical Consulting (10 hrs) | 10 | $150.00 | $1,500.00"
    ],
    "payment_terms": "Net 30",
    "billing_address": "TechCorp Inc.\n456 Innovation Drive, Nairobi, Kenya",
    "notes": "Payment Instructions: Bank transfer to Equity Bank, Account No. 1234567890"
  },
  "summary": "Invoice #INV-2026-0042 from Acme Solutions Ltd. for USD $6,032.00.",
  "processing_notes": null
}

Try it with a contract file to see the agent switch tools automatically:

curl -X POST http://YOUR_VULTR_IP:8000/analyze \
  -F "file=@sample_docs/sample_contract.txt;type=text/plain"

The agent will call save_contract_extraction instead and return parties, obligations, termination conditions, and governing law.

What's Happening Under the Hood

When a file hits the /analyze endpoint, here is the execution path:

  1. FastAPI reads the file bytes and validates the MIME type
  2. analyze_document creates a new ADK session and sends the file to Gemini via InMemoryRunner
  3. The agent reads the document using Gemini's native multimodal understanding
  4. Based on what it reads, the agent calls one of the three extraction tools
  5. The tool writes structured data into the session state
  6. After the agent finishes, we read that state and return it as JSON

The key design decision is that the tools do not receive the document. Gemini has already read it from the multimodal message context. The tools only receive the extracted fields as typed arguments, which forces the model to commit to specific values rather than returning freeform text.

Frequently Asked Questions

Q: What file types does the /analyze endpoint accept? PDF, JPEG, PNG, WebP, plain text (.txt), and Markdown (.md) files up to 20 MB. For larger files, swap Part.from_bytes for the Gemini Files API and pass a file URI instead.

Q: Why must list parameters in ADK tools be typed as list[str] instead of just list? The Gemini API generates a JSON schema from your tool's type annotations. An untyped list produces a schema without an items field, which the API rejects with a 400 INVALID_ARGUMENT error. Using list[str] (or any concrete generic) generates the required items: {type: string} field automatically.

Q: Can I use this stack in an AI hackathon 2026 project? Yes — this is the point. The Docker Compose setup deploys to any cloud instance in one command, and the ADK tool-calling pattern makes it easy to extend with new document types or swap Gemini for another model. Fork the GitHub repo and build on top of it. Check upcoming AI hackathons on LabLab.ai to find a competition to enter.

Q: How do I add support for a new document type, like purchase orders? Create a new extraction function in app/tools.py (e.g. save_purchase_order_extraction) following the same pattern — typed parameters, ToolContext as first argument, write to tool_context.state. Then add it to the tools list in app/agent.py and update the INSTRUCTION string to tell the agent when to call it.

Next Steps

  • Add a firewall — On Vultr, create a Firewall Group under Network to restrict port 8000 to trusted IPs, or put Nginx in front as a reverse proxy.
  • Handle larger files — For files over 20MB, swap Part.from_bytes for the Gemini Files API (client.files.upload) and pass a file URI. Gemini supports PDFs up to 1,000 pages.
  • Add more document types — Define a new tool (save_receipt_extraction, save_purchase_order_extraction), add it to the agent, and update the instruction to describe when to call it.
  • Persist results — Swap InMemorySessionService for a database-backed session service and store extraction results in Postgres or Supabase.

Conclusion

You now have a working multimodal document agent running on a live Vultr server. The agent uses Google ADK's tool-calling pattern to force structured outputs from Gemini without prompt engineering tricks, and the FastAPI wrapper makes it consumable by any frontend or backend system.

The full source code for this project is available in the GitHub repository. If you want to apply this to a real hackathon project, explore the upcoming AI hackathons on LabLab.ai and bring this stack with you.

Similar Tutorials