AMD Developer Cloud Tutorial: Host Your First LLM on AMD GPU for AI Hackathons

Introduction
If you've been building AI applications but relying entirely on managed API endpoints, this tutorial is your entry point into running models on raw GPU hardware, your own endpoint, your own model, your own infrastructure.
AMD Developer Cloud gives you on-demand access to the AMD Instinct MI300X: the GPU with 192 GB of VRAM that Meta uses to run 100% of live Llama 405B traffic in production. You don't need physical hardware, a data center, or an NVIDIA tax. For $1.99/hour (or free with credits), you get a machine that can load almost any open-source model available today.
In this tutorial you'll go from zero to a live, publicly accessible AI API endpoint in under 30 minutes:
- Sign up for the AMD AI Developer Program and claim your free credits
- Create a GPU Droplet pre-loaded with vLLM
- SSH in, launch a model, and hit the API from your laptop
This setup is particularly practical for AI hackathons, where you need a real inference endpoint running fast without spending days on infrastructure. Having your own AMD-hosted model gives you full control over latency, model selection, and cost, a real advantage when working under hackathon time constraints. If you're looking for upcoming AI hackathons to put this into practice, lablab.ai runs global events year-round.
Participate in the AMD Developer Hackathon on lablab.ai and put this stack to work.
Why AMD Developer Cloud?
Before diving in, here's why this setup makes sense for developers building AI applications:
| Feature | AMD MI300X | NVIDIA H100 |
|---|---|---|
| VRAM | 192 GB | 80 GB |
| Memory bandwidth | 5.3 TB/s | 3.35 TB/s |
| Price (cloud) | $1.99/hr | ~$4-6/hr |
| Open-source stack | ROCm (fully open) | CUDA (proprietary) |
The MI300X is the only GPU that can run Llama 3.1 405B on a single node without multi-GPU sharding. For inference serving, vLLM on MI300X achieves 1.5x higher throughput and 1.7x faster time-to-first-token compared to other setups. For a hackathon or a prototype, it's the highest-value GPU you can access today.
Prerequisites
- A computer with a terminal (Mac, Linux, or Windows with WSL)
- An SSH key pair (you'll generate one if you don't have one)
- A credit card on file (required to unlock GPU access, charges are covered by free credits)
Phase 1: Getting Access and Credits
Joining the AMD AI Developer Program
The fastest way to get GPU credits is through the AMD AI Developer Program. It gives you $100 in credits (~50 hours on a single MI300X) just for signing up, no approval process.
- Go to AMD AI Developer Program
- Create a free account
- Credits are applied automatically to your AMD Developer Cloud account
As a bonus, the program also includes a private Discord with AMD engineers and a one-month DeepLearning.AI pro membership.
Accessing AMD Developer Cloud
AMD Developer Cloud is powered by DigitalOcean. Once you've signed up on the AMD AI Developer Program, proceed to sign up at AMD Developer Cloud.
Sign in with your AMD Developer account. You'll land on a standard DigitalOcean-style dashboard, but this instance gives you access to AMD Instinct GPU hardware that isn't available on the regular DigitalOcean.
Adding a Payment Method
Before you can create a GPU Droplet, the platform requires a valid payment method on file, even if your $100 credits will cover everything. Without a card, the "Create GPU Droplet" button stays grayed out.
Go to Billing in the left sidebar and add a card. Your credits will be consumed first; the card is a safety net for overage.
Cost reference: At $1.99/hr for a single MI300X, $100 gives you approximately 50 hours of GPU time. Credits typically expire 30 days after being applied, check your expiry date under Billing.
Phase 2: Creating the GPU Droplet
Choosing Your Configuration
Navigate to GPU Droplets in the sidebar and click Create GPU Droplet.
You'll be presented with several choices. Here's what to select and why:
Region
Choose ATL1 (Atlanta) as it consistently has MI300X availability. If you see capacity issues, try the other available region.
GPU Plan: MI300X (1 GPU)
| Spec | Value |
|---|---|
| GPU | 1x AMD Instinct MI300X |
| VRAM | 192 GB HBM3 |
| vCPU | 20 |
| RAM | 240 GB |
| Boot disk | 720 GB NVMe |
| Scratch disk | 5 TB |
| Cost | $1.99/hr |
The single MI300X is the right choice for running and serving a model. The 8x MI300X option ($15.92/hr) is for large distributed training jobs or serving 70B+ parameter models at scale, overkill for a first deployment.
Image: vLLM Quick Start Package
When selecting the image, you'll see several options:
| Image | What it is |
|---|---|
| vLLM (recommended) | Pre-configured LLM inference engine with OpenAI-compatible API |
| ROCm Software | Bare AMD GPU environment, requires manual setup |
| SGLang | Alternative inference framework |
| PyTorch | Raw PyTorch, for training not serving |
| Megatron | Large-scale distributed training only |
Select the vLLM Quick Start image. This gives you a pre-built Docker container with vLLM, ROCm, and all dependencies already installed and configured. Zero setup time.
SSH Key
GPU Droplets require SSH key authentication, no password login. If you already have a key pair, add your public key here. If not, generate one:
ssh-keygen -t ed25519 -C "[email protected]"
Your public key is at ~/.ssh/id_ed25519.pub. Copy its contents and paste it into the SSH key field.
Creating the Droplet
Click Create GPU Droplet. The droplet takes 2-4 minutes to provision. You'll see it move from Creating to Active in the dashboard. Once active, copy the Public IP address as you'll need it for the next step.
Phase 3: Connecting and Launching a Model
SSH Into the Droplet
Once the droplet is active, connect to it from your terminal:
ssh root@<your-droplet-ip>
Replace <your-droplet-ip> with the public IP from the dashboard. If your SSH key is in a non-default location, specify it with -i:
ssh -i ~/.ssh/id_ed25519 root@<your-droplet-ip>
You'll land in a root shell on Ubuntu 24.04.
Entering the vLLM Container
The vLLM Quick Start image runs a pre-built Docker container named rocm. Enter it:
docker exec -it rocm /bin/bash
You're now inside the container where vLLM is installed and the AMD ROCm stack is ready.
Launching Your First Model
Run the following command to start the vLLM API server with Qwen2.5-1.5B-Instruct:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-1.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
> /tmp/vllm.log 2>&1 &
Why Qwen2.5-1.5B?
- Downloads freely from HuggingFace with no authentication (unlike Llama, which requires accepting Meta's license)
- At 1.5B parameters (~3 GB), it loads in seconds on the MI300X's 192 GB of VRAM
- Supports the OpenAI chat format (
/v1/chat/completions) out of the box - Produces quality responses for demos and prototypes
Why --dtype float16?
The model's default dtype is bfloat16, but float16 has broader compatibility with AMD ROCm at this stage. vLLM will warn you about this, casting to float16 is the correct choice here.
Why run it in the background with &?
Running in the foreground locks your terminal. The background flag (&) plus log redirect (> /tmp/vllm.log 2>&1) lets you monitor startup progress while keeping your shell free.
Monitoring Startup
Watch the logs until the server is ready:
tail -f /tmp/vllm.log
You'll see the model weights download from HuggingFace, load into GPU memory, and the ROCm graph compilation complete. The whole process takes about 25-30 seconds on first run. Wait for this line:
INFO: Application startup complete.
Then press Ctrl+C to stop tailing the log. Your server is live.
Phase 4: Testing the API
Test From Inside the Droplet
With the server running, run a quick test from inside the container:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
You'll get a response like:
{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"choices": [{
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 30,
"completion_tokens": 10,
"total_tokens": 40
}
}
Test From Your Laptop
Open a new terminal on your local machine and call the public IP directly:
curl -s http://<your-droplet-ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Explain ROCm in one sentence."}]
}'
Port 8000 is accessible from the public internet because the vLLM Docker container maps it directly via iptables. Docker bypasses UFW firewall rules when using port mapping, so your endpoint is live.
Calling it Like the OpenAI SDK
Because vLLM exposes an OpenAI-compatible API, you can use the standard openai Python client by pointing it at your droplet:
from openai import OpenAI
client = OpenAI(
base_url="http://<your-droplet-ip>:8000/v1",
api_key="not-required" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[{"role": "user", "content": "What can you do?"}]
)
print(response.choices[0].message.content)
Any code that works with OpenAI's gpt-4o works with your AMD-hosted model, just change the base_url and model name.
Testing With a Chat UI
If you'd rather test your endpoint through a proper chat interface instead of curl commands, you can use this single-file HTML chat template built specifically for vLLM and OpenAI-compatible endpoints: vllm-chat-template.
No build step, no dependencies. Clone it, open the file in a browser, and you're chatting with your AMD-hosted model.

Step 1. Clone the repo:
git clone https://github.com/Stephen-Kimoi/vllm-chat-template
cd vllm-chat-template
Step 2. Enable CORS on your vLLM server.
By default, the browser will block requests from a local HTML file to your remote server. Restart vLLM with CORS enabled:
pkill -f vllm
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-1.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
--allowed-origins '["*"]' \
> /tmp/vllm.log 2>&1 &
Step 3. Configure the template.
Open index.html and update the CONFIG block near the top of the <script> section:
const CONFIG = {
apiUrl: 'http://<your-droplet-ip>:8000/v1/chat/completions',
model: 'Qwen/Qwen2.5-1.5B-Instruct',
badge: 'vLLM',
headerTitle: 'Chat - AMD MI300X',
statusText: 'Connected',
footerInfo: 'vLLM Β· OpenAI-Compatible API',
specs: [
{ label: 'Endpoint', value: '<your-droplet-ip>:8000', accent: true },
{ label: 'Model', value: 'Qwen2.5-1.5B-Instruct' },
{ label: 'Hardware', value: 'AMD MI300X' },
],
};
Replace <your-droplet-ip> with your droplet's public IP, save the file, and open it directly in your browser (open index.html on Mac or double-click on Windows).
The UI shows a full chat interface on the left and the raw request/response JSON on the right, useful for seeing exactly what your endpoint is returning.
Phase 5: Loading Larger Models
Running a 7B Model
Once you've confirmed the basic setup works, try a larger model. The MI300X's 192 GB of VRAM means you have room to run models that would require multiple GPUs elsewhere:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
> /tmp/vllm.log 2>&1 &
At 7B parameters (~14 GB), this model loads in about a minute and produces significantly better outputs than the 1.5B variant.
Running Llama Models
For Llama 3.x models, you'll need to accept Meta's license on HuggingFace first, then pass your token:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
--hf-token <your-huggingface-token> \
> /tmp/vllm.log 2>&1 &
Checking GPU Memory Usage
To see how much VRAM your model is using:
rocm-smi
This is the AMD equivalent of nvidia-smi. You'll see GPU utilization, memory usage, and temperature.
Phase 6: Cost Management
The Most Important Rule
Even when the droplet is powered off, you are still billed. To stop charges completely, you must Destroy the droplet from the dashboard, not just power it off.
Quick Reference: Stop and Restart
Stop vLLM (inside the container):
pkill -f vllm
Full restart sequence from your laptop:
# 1. SSH in
ssh root@<your-droplet-ip>
# 2. Enter the container
docker exec -it rocm /bin/bash
# 3. Start vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-1.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
> /tmp/vllm.log 2>&1 &
# 4. Wait for startup (~30s on subsequent runs, weights are cached)
tail -f /tmp/vllm.log
Snapshots
If you want to save your environment and restore it later, take a snapshot before destroying:
AMD Developer Cloud > GPU Droplets > your droplet > Snapshots > Take Snapshot
Snapshots incur a small storage cost but let you restore the exact state of your droplet without re-running setup.
Cost Breakdown
| Resource | Rate | Notes |
|---|---|---|
| MI300X (1 GPU) | $1.99/hr | Covered by $100 credits |
| MI300X (8 GPU) | $15.92/hr | For large-scale training only |
| Bandwidth | Included | Standard transfer pool |
| Snapshot storage | ~$0.05/GB/month | Optional |
Summary: What You Built
| Component | Detail |
|---|---|
| Hardware | AMD Instinct MI300X, 192 GB HBM3 VRAM |
| Platform | AMD Developer Cloud (DigitalOcean) |
| Inference engine | vLLM (OpenAI-compatible) |
| Model | Qwen/Qwen2.5-1.5B-Instruct |
| API endpoint | POST /v1/chat/completions |
| Accessible from | Anywhere via public IP |
| Cost | $1.99/hr, ~50 hrs covered by free credits |
You now have a live LLM endpoint running on AMD GPU hardware, the same infrastructure that powers production AI at Meta, Microsoft, and Oracle. From here you can swap in any model from HuggingFace, connect it to a LangChain agent, a RAG pipeline, or any application that speaks the OpenAI API format.
Frequently Asked Questions
Can I use AMD Developer Cloud for an AI hackathon?
Yes. AMD Developer Cloud is well-suited for AI hackathons. For $1.99/hr (or free with the $100 credit grant from the AMD AI Developer Program), you get a dedicated MI300X instance with 192 GB of VRAM that can run any open-source model available today. The vLLM Quick Start image means you can have a live API endpoint in under 30 minutes, which matters when you're working against a hackathon deadline.
Do I need GPU or Linux experience to follow this tutorial?
No prior GPU experience is required. The tutorial uses a pre-configured vLLM Docker image that handles the entire ROCm and CUDA-equivalent setup for you. Basic terminal comfort (SSH, running commands) is all you need.
What AI hackathon projects can I build once I have this endpoint running?
Once you have a live OpenAI-compatible endpoint, you can build anything that normally connects to a GPT model: AI agents with LangChain or CrewAI, RAG systems, domain-specific chatbots, multi-model pipelines, and agentic workflows. The AMD MI300X's 192 GB of VRAM also makes it practical for vision and multimodal models, opening up computer vision and document understanding use cases.
How long does it take to get the endpoint running?
Under 30 minutes from scratch. Droplet provisioning takes 2-4 minutes. The first model load (Qwen2.5-1.5B) takes about 30 seconds. After the first run, model weights are cached and subsequent starts take under 10 seconds.
Is AMD Developer Cloud free for AI hackathons?
The AMD AI Developer Program gives you $100 in credits (about 50 hours on a single MI300X) just for signing up. For hackathons specifically, AMD also provides bulk GPU credit grants to event organizers, so participants may receive additional credits depending on the event.
Resources
- AMD AI Developer Program (free credits)
- AMD Developer Cloud
- Getting Started on AMD Developer Cloud
- ROCm Documentation
- ROCm Linux Installation Guide
- ROCm on GitHub
- AMD Training Resources
Build your next AI project on AMD GPUs and join the AMD Developer Hackathon on lablab.ai.