AMD Developer Cloud Tutorial: Host Your First LLM on AMD GPU for AI Hackathons

Tuesday, March 24, 2026bykimoisteve

Introduction

If you've been building AI applications but relying entirely on managed API endpoints, this tutorial is your entry point into running models on raw GPU hardware, your own endpoint, your own model, your own infrastructure.

AMD Developer Cloud gives you on-demand access to the AMD Instinct MI300X: the GPU with 192 GB of VRAM that Meta uses to run 100% of live Llama 405B traffic in production. You don't need physical hardware, a data center, or an NVIDIA tax. For $1.99/hour (or free with credits), you get a machine that can load almost any open-source model available today.

In this tutorial you'll go from zero to a live, publicly accessible AI API endpoint in under 30 minutes:

Sign up for the AMD AI Developer Program and claim your free credits
Create a GPU Droplet pre-loaded with vLLM
SSH in, launch a model, and hit the API from your laptop

This setup is particularly practical for AI hackathons, where you need a real inference endpoint running fast without spending days on infrastructure. Having your own AMD-hosted model gives you full control over latency, model selection, and cost, a real advantage when working under hackathon time constraints. If you're looking for upcoming AI hackathons to put this into practice, lablab.ai runs global events year-round.

Participate in the AMD Developer Hackathon on lablab.ai and put this stack to work.

Why AMD Developer Cloud?

Before diving in, here's why this setup makes sense for developers building AI applications:

Feature	AMD MI300X	NVIDIA H100
VRAM	192 GB	80 GB
Memory bandwidth	5.3 TB/s	3.35 TB/s
Price (cloud)	$1.99/hr	~$4-6/hr
Open-source stack	ROCm (fully open)	CUDA (proprietary)

The MI300X is the only GPU that can run Llama 3.1 405B on a single node without multi-GPU sharding. For inference serving, vLLM on MI300X achieves 1.5x higher throughput and 1.7x faster time-to-first-token compared to other setups. For a hackathon or a prototype, it's the highest-value GPU you can access today.

Prerequisites

A computer with a terminal (Mac, Linux, or Windows with WSL)
An SSH key pair (you'll generate one if you don't have one)
A credit card on file (required to unlock GPU access, charges are covered by free credits)

Phase 1: Getting Access and Credits

Joining the AMD AI Developer Program

The fastest way to get GPU credits is through the AMD AI Developer Program. It gives you $100 in credits (~50 hours on a single MI300X) just for signing up, no approval process.

Go to AMD AI Developer Program
Create a free account
Credits are applied automatically to your AMD Developer Cloud account

As a bonus, the program also includes a private Discord with AMD engineers and a one-month DeepLearning.AI pro membership.

Accessing AMD Developer Cloud

AMD Developer Cloud is powered by DigitalOcean. Once you've signed up on the AMD AI Developer Program, proceed to sign up at AMD Developer Cloud.

Sign in with your AMD Developer account. You'll land on a standard DigitalOcean-style dashboard, but this instance gives you access to AMD Instinct GPU hardware that isn't available on the regular DigitalOcean.

Adding a Payment Method

Before you can create a GPU Droplet, the platform requires a valid payment method on file, even if your $100 credits will cover everything. Without a card, the "Create GPU Droplet" button stays grayed out.

Go to Billing in the left sidebar and add a card. Your credits will be consumed first; the card is a safety net for overage.

Cost reference: At $1.99/hr for a single MI300X, $100 gives you approximately 50 hours of GPU time. Credits typically expire 30 days after being applied, check your expiry date under Billing.

Phase 2: Creating the GPU Droplet

Choosing Your Configuration

Navigate to GPU Droplets in the sidebar and click Create GPU Droplet.

You'll be presented with several choices. Here's what to select and why:

Region

Choose ATL1 (Atlanta) as it consistently has MI300X availability. If you see capacity issues, try the other available region.

GPU Plan: MI300X (1 GPU)

Spec	Value
GPU	1x AMD Instinct MI300X
VRAM	192 GB HBM3
vCPU	20
RAM	240 GB
Boot disk	720 GB NVMe
Scratch disk	5 TB
Cost	$1.99/hr

The single MI300X is the right choice for running and serving a model. The 8x MI300X option ($15.92/hr) is for large distributed training jobs or serving 70B+ parameter models at scale, overkill for a first deployment.

Image: vLLM Quick Start Package

When selecting the image, you'll see several options:

Image	What it is
vLLM (recommended)	Pre-configured LLM inference engine with OpenAI-compatible API
ROCm Software	Bare AMD GPU environment, requires manual setup
SGLang	Alternative inference framework
PyTorch	Raw PyTorch, for training not serving
Megatron	Large-scale distributed training only

Select the vLLM Quick Start image. This gives you a pre-built Docker container with vLLM, ROCm, and all dependencies already installed and configured. Zero setup time.

SSH Key

GPU Droplets require SSH key authentication, no password login. If you already have a key pair, add your public key here. If not, generate one:

ssh-keygen -t ed25519 -C "[email protected]"

Your public key is at ~/.ssh/id_ed25519.pub. Copy its contents and paste it into the SSH key field.

Creating the Droplet

Click Create GPU Droplet. The droplet takes 2-4 minutes to provision. You'll see it move from Creating to Active in the dashboard. Once active, copy the Public IP address as you'll need it for the next step.

Phase 3: Connecting and Launching a Model

SSH Into the Droplet

Once the droplet is active, connect to it from your terminal:

ssh root@<your-droplet-ip>

Replace <your-droplet-ip> with the public IP from the dashboard. If your SSH key is in a non-default location, specify it with -i:

ssh -i ~/.ssh/id_ed25519 root@<your-droplet-ip>

You'll land in a root shell on Ubuntu 24.04.

Entering the vLLM Container

The vLLM Quick Start image runs a pre-built Docker container named rocm. Enter it:

docker exec -it rocm /bin/bash

You're now inside the container where vLLM is installed and the AMD ROCm stack is ready.

Launching Your First Model

Run the following command to start the vLLM API server with Qwen2.5-1.5B-Instruct:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  > /tmp/vllm.log 2>&1 &

Why Qwen2.5-1.5B?

Downloads freely from HuggingFace with no authentication (unlike Llama, which requires accepting Meta's license)
At 1.5B parameters (~3 GB), it loads in seconds on the MI300X's 192 GB of VRAM
Supports the OpenAI chat format (/v1/chat/completions) out of the box
Produces quality responses for demos and prototypes

Why --dtype float16?

The model's default dtype is bfloat16, but float16 has broader compatibility with AMD ROCm at this stage. vLLM will warn you about this, casting to float16 is the correct choice here.

Why run it in the background with &?

Running in the foreground locks your terminal. The background flag (&) plus log redirect (> /tmp/vllm.log 2>&1) lets you monitor startup progress while keeping your shell free.

Monitoring Startup

Watch the logs until the server is ready:

tail -f /tmp/vllm.log

You'll see the model weights download from HuggingFace, load into GPU memory, and the ROCm graph compilation complete. The whole process takes about 25-30 seconds on first run. Wait for this line:

INFO:     Application startup complete.

Then press Ctrl+C to stop tailing the log. Your server is live.

Phase 4: Testing the API

Test From Inside the Droplet

With the server running, run a quick test from inside the container:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

You'll get a response like:

{
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 10,
    "total_tokens": 40
  }
}

Test From Your Laptop

Open a new terminal on your local machine and call the public IP directly:

curl -s http://<your-droplet-ip>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Explain ROCm in one sentence."}]
  }'

Port 8000 is accessible from the public internet because the vLLM Docker container maps it directly via iptables. Docker bypasses UFW firewall rules when using port mapping, so your endpoint is live.

Calling it Like the OpenAI SDK

Because vLLM exposes an OpenAI-compatible API, you can use the standard openai Python client by pointing it at your droplet:

from openai import OpenAI

client = OpenAI(
    base_url="http://<your-droplet-ip>:8000/v1",
    api_key="not-required"  # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[{"role": "user", "content": "What can you do?"}]
)

print(response.choices[0].message.content)

Any code that works with OpenAI's gpt-4o works with your AMD-hosted model, just change the base_url and model name.

Testing With a Chat UI

If you'd rather test your endpoint through a proper chat interface instead of curl commands, you can use this single-file HTML chat template built specifically for vLLM and OpenAI-compatible endpoints: vllm-chat-template.

No build step, no dependencies. Clone it, open the file in a browser, and you're chatting with your AMD-hosted model.

vLLM Chat Template UI showing a live chat interface connected to an AMD-hosted model — The vLLM Chat Template connected to a live Qwen model on AMD Developer Cloud

Step 1. Clone the repo:

git clone https://github.com/Stephen-Kimoi/vllm-chat-template
cd vllm-chat-template

Step 2. Enable CORS on your vLLM server.

By default, the browser will block requests from a local HTML file to your remote server. Restart vLLM with CORS enabled:

pkill -f vllm

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  --allowed-origins '["*"]' \
  > /tmp/vllm.log 2>&1 &

Step 3. Configure the template.

Open index.html and update the CONFIG block near the top of the <script> section:

const CONFIG = {
  apiUrl: 'http://<your-droplet-ip>:8000/v1/chat/completions',
  model:  'Qwen/Qwen2.5-1.5B-Instruct',

  badge:       'vLLM',
  headerTitle: 'Chat - AMD MI300X',
  statusText:  'Connected',
  footerInfo:  'vLLM · OpenAI-Compatible API',

  specs: [
    { label: 'Endpoint', value: '<your-droplet-ip>:8000', accent: true },
    { label: 'Model',    value: 'Qwen2.5-1.5B-Instruct' },
    { label: 'Hardware', value: 'AMD MI300X' },
  ],
};

Replace <your-droplet-ip> with your droplet's public IP, save the file, and open it directly in your browser (open index.html on Mac or double-click on Windows).

The UI shows a full chat interface on the left and the raw request/response JSON on the right, useful for seeing exactly what your endpoint is returning.

Phase 5: Loading Larger Models

Running a 7B Model

Once you've confirmed the basic setup works, try a larger model. The MI300X's 192 GB of VRAM means you have room to run models that would require multiple GPUs elsewhere:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  > /tmp/vllm.log 2>&1 &

At 7B parameters (~14 GB), this model loads in about a minute and produces significantly better outputs than the 1.5B variant.

Running Llama Models

For Llama 3.x models, you'll need to accept Meta's license on HuggingFace first, then pass your token:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  --hf-token <your-huggingface-token> \
  > /tmp/vllm.log 2>&1 &

Checking GPU Memory Usage

To see how much VRAM your model is using:

rocm-smi

This is the AMD equivalent of nvidia-smi. You'll see GPU utilization, memory usage, and temperature.

Phase 6: Cost Management

The Most Important Rule

Even when the droplet is powered off, you are still billed. To stop charges completely, you must Destroy the droplet from the dashboard, not just power it off.

Quick Reference: Stop and Restart

Stop vLLM (inside the container):

pkill -f vllm

Full restart sequence from your laptop:

# 1. SSH in
ssh root@<your-droplet-ip>

# 2. Enter the container
docker exec -it rocm /bin/bash

# 3. Start vLLM
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  > /tmp/vllm.log 2>&1 &

# 4. Wait for startup (~30s on subsequent runs, weights are cached)
tail -f /tmp/vllm.log

Snapshots

If you want to save your environment and restore it later, take a snapshot before destroying:

AMD Developer Cloud > GPU Droplets > your droplet > Snapshots > Take Snapshot

Snapshots incur a small storage cost but let you restore the exact state of your droplet without re-running setup.

Cost Breakdown

Resource	Rate	Notes
MI300X (1 GPU)	$1.99/hr	Covered by $100 credits
MI300X (8 GPU)	$15.92/hr	For large-scale training only
Bandwidth	Included	Standard transfer pool
Snapshot storage	~$0.05/GB/month	Optional

Summary: What You Built

Component	Detail
Hardware	AMD Instinct MI300X, 192 GB HBM3 VRAM
Platform	AMD Developer Cloud (DigitalOcean)
Inference engine	vLLM (OpenAI-compatible)
Model	Qwen/Qwen2.5-1.5B-Instruct
API endpoint	POST /v1/chat/completions
Accessible from	Anywhere via public IP
Cost	$1.99/hr, ~50 hrs covered by free credits

You now have a live LLM endpoint running on AMD GPU hardware, the same infrastructure that powers production AI at Meta, Microsoft, and Oracle. From here you can swap in any model from HuggingFace, connect it to a LangChain agent, a RAG pipeline, or any application that speaks the OpenAI API format.

Frequently Asked Questions

Can I use AMD Developer Cloud for an AI hackathon?

Yes. AMD Developer Cloud is well-suited for AI hackathons. For $1.99/hr (or free with the $100 credit grant from the AMD AI Developer Program), you get a dedicated MI300X instance with 192 GB of VRAM that can run any open-source model available today. The vLLM Quick Start image means you can have a live API endpoint in under 30 minutes, which matters when you're working against a hackathon deadline.

Do I need GPU or Linux experience to follow this tutorial?

No prior GPU experience is required. The tutorial uses a pre-configured vLLM Docker image that handles the entire ROCm and CUDA-equivalent setup for you. Basic terminal comfort (SSH, running commands) is all you need.

What AI hackathon projects can I build once I have this endpoint running?

Once you have a live OpenAI-compatible endpoint, you can build anything that normally connects to a GPT model: AI agents with LangChain or CrewAI, RAG systems, domain-specific chatbots, multi-model pipelines, and agentic workflows. The AMD MI300X's 192 GB of VRAM also makes it practical for vision and multimodal models, opening up computer vision and document understanding use cases.

How long does it take to get the endpoint running?

Under 30 minutes from scratch. Droplet provisioning takes 2-4 minutes. The first model load (Qwen2.5-1.5B) takes about 30 seconds. After the first run, model weights are cached and subsequent starts take under 10 seconds.

Is AMD Developer Cloud free for AI hackathons?

The AMD AI Developer Program gives you $100 in credits (about 50 hours on a single MI300X) just for signing up. For hackathons specifically, AMD also provides bulk GPU credit grants to event organizers, so participants may receive additional credits depending on the event.

Resources

Build your next AI project on AMD GPUs and join the AMD Developer Hackathon on lablab.ai.

Steve Kimoi