Go back

Setting Up a Gonka.ai H100 GPU Node: A Technical Journey with AI Assistance

Written by
Olena Tkhorovska
Olena Tkhorovska
on January 13th, 2026
A Technical Journey with AI Assistance

An honest account of running a decentralized AI compute node, featuring the surprising difference between ChatGPT and Claude Code. To learn why we are sharing the full technical journey, read the Preamble.

Working vs Earning

Day 2, hour 26 of running my Gonka.ai node. Everything looks perfect: synced to block 1,979,450, all services green, 76 successful validations... and exactly 0.00 GONKA in earnings. Then I check the logs: no validation activity for 14 hours. The GPU sits idle. The network is silent.

Is my node even participating?

This is the story of setting up a decentralized AI compute node on an H100 GPU, and discovering that "working" and "earning" are two very different things. But more importantly, it's about discovering that the AI tool you choose for infrastructure work can mean the difference between success and giving up entirely.

Why This Setup Was Different

I'm not a DevOps expert or blockchain specialist. I write code, sure, but managing infrastructure, debugging SSH issues, and configuring blockchain validators? That's not my daily work.

When I decided to run a Gonka.ai node on my H100 GPU, my first instinct was obvious: "I'll use ChatGPT to help guide me through this."

That lasted about 30 minutes before I hit the wall.

The ChatGPT Problem (And Why It Doesn't Work for Infrastructure)

Here's what the ChatGPT workflow looks like for complex infrastructure setup:

  1. You : "Help me set up a Gonka blockchain node"
  2. ChatGPT : Provides 200 lines of example code and commands
  3. You : Copy-paste the code into a file
  4. Terminal : Error: command not found
  5. You : "I get error: command not found"
  6. ChatGPT : "Oh, you need to install this first. Try: sudo apt install..."
  7. You : Copy-paste, new error appears
  8. You : "Now I get: Permission denied"
  9. ChatGPT : "That means you need to configure..."
  10. Repeat 15-20 times

The fundamental problem: ChatGPT can't see your files, run commands, or verify if anything actually worked. Every response is theoretical. It's giving advice to someone it can't see, about a system it can't access, with errors it can't verify.

For someone without deep Linux/Docker/blockchain expertise, this quickly becomes a game of copy-paste roulette. You're not learning—you're just hoping the next code block will magically work.

What I Actually Used: Claude Code

Instead, I used Claude Code (Anthropic's CLI coding assistant). Not the web chat interface—the actual command-line tool that runs on your machine.

Here's the difference:

Claude Code can:

✅ Read your actual files (knows what's already there)

✅ Edit files directly (no copy-paste)

✅ Run SSH commands on your server

✅ See error messages immediately

✅ Verify if something worked before moving on

✅ Maintain full context of your entire setup

ChatGPT cannot:

❌ Access your filesystem

❌ Execute commands

❌ See if its suggestions worked

❌ Fix errors in real-time

A Real Example: The Difference in Practice

With ChatGPT (hypothetical):

Me: "My Docker services won't start"

ChatGPT: "Check your .env file. It might have export statements. Try this format: [provides 50-line example file]"

Me: [Reads example, tries to figure out which parts to change]

Me: [Edits file, possibly wrong parts]

Me: [Runs docker compose up, gets different error]

Me: "Now I get: invalid syntax"

ChatGPT: "That could be several things. First check..."

[30 minutes later, still debugging]

With Claude Code (what actually happened):

Me: "My Docker services won't start"

Claude Code:

  • Reads your actual .env file
  • "Found the problem: line 3 has 'export' which Docker Compose doesn't support"
  • Edits the file automatically (removes 'export' statements)
  • Runs docker compose config to verify syntax
  • Runs docker compose up -d
  • Checks if services are running
  • "All services started successfully ✓"

Total time: 2 minutes

This isn't a theoretical difference. This is the actual experience I had, repeatedly, throughout the 8-hour setup process.

The Transparency Bit

Before we go further: This isn't sponsored content. I'm not affiliated with Anthropic. Other tools exist (Cursor, GitHub Copilot, Aider) that might work similarly—I just happened to use Claude Code because it was designed for CLI infrastructure work.

I'm sharing this because if you're a non-expert trying to run complex infrastructure, the tool you choose matters enormously. ChatGPT is phenomenal for learning concepts and brainstorming. But for actual infrastructure operations? You need something that can execute, verify, and iterate.

Alright, with that context: here's how I set up a Gonka.ai node earning rewards on the H100 GPU in my home lab.

What is Gonka.ai?

Quick context: Gonka is a decentralized AI inference network. Instead of running AI models on centralized cloud providers (OpenAI, Anthropic, Google), Gonka distributes inference work across a network of independent GPU nodes.

How you earn:

  • You run AI models (like Qwen3-32B) on your GPU
  • The network sends inference requests to your node
  • You process them and return results
  • You get paid in GONKA tokens for computational work
  • Bonus: You also validate other nodes' work (Proof of Compute 2.0)

My setup:

  • Hardware: NVIDIA H100 PCIe (81GB VRAM)
  • Goal: Monetize idle GPU time + contribute to decentralized AI
  • Reality check: This isn't passive income – it's active learning

Prerequisites & Planning

Before starting, I verified I had what I needed. The official docs (gonka.ai/host/quickstart) list requirements, but here's what actually mattered:

Hardware Requirements (What I Had)

# Checked my GPU

nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv

# Output:

# H100 PCIe, 81920 MiB, 580.95.05

GPU: H100 PCIe with 81GB VRAM 

Driver: 580.95.05 (minimum: 535+) 

CUDA: 13.0 (minimum: 12.6) 

Disk: 3.3TB available (you need 500GB minimum for blockchain + model) 

Network: Public IP with ports 22, 5000, 26657, 8000 accessible

Software Prerequisites

docker --version        # 24.0.7

docker compose version  # 2.23.0

Mental Prerequisites

The docs say: "Setup time: ~2 hours"

Actual time with Claude Code assistance: 8 hours (including debugging, learning, dashboard building)

Estimated time if I'd done it manually with ChatGPT: 16-20 hours, or I would've given up

Key mindset: You're going to hit problems. The difference is whether you have a tool that can actually help you solve them.

The Biggest Early Mistake (That I Avoided)

The official quickstart examples show configurations for 4-GPU setups running the massive Qwen3-235B model. If you blindly follow those examples with a single GPU, you'll spend hours wondering why nothing loads.

Spoiler: You need the single-GPU configuration with the smaller Qwen3-32B model. More on this in the "Model Configuration" section.

Phase 1: Local Setup (The Easy Part)

This went smoothly because Claude Code handled all the file management and validation.

1. Account Creation

Downloaded the Gonka CLI and created my account:

cd ~/gonka-setup

./bin/inferenced keys add my-account-key

Critical moment: The CLI generates a 24-word mnemonic phrase. This is your master password. Lose this = lose access forever. No recovery.

I saved it to:

  • Encrypted USB drive
  • Password manager (encrypted)
  • Physical paper in a safe

Claude Code helped here: Created the secure directory structure and reminded me to back up before proceeding.

2. ML Operations Key

For operational security, you need a separate key that the node uses (not your main account key):

./bin/inferenced keys add ml-ops-key

Same security applies—24-word phrase, store it safely.

3. Hugging Face Token

To download AI models, you need a Hugging Face token:

  1. Go to huggingface.co → Settings → Access Tokens
  2. Create read-only token
  3. Save to keys/huggingface-token.txt

4. Directory Structure

Claude Code set this up automatically:

gonka-setup/

├── bin/              # CLI tools

├── keys/             # 🔐 CRITICAL: All secrets here

│   ├── mnemonic.txt

│   ├── ml-ops-mnemonic.txt

│   ├── huggingface-token.txt

│   └── keyring-password.txt

├── configs/          # Configuration backups

└── logs/             # You'll need these for debugging

Lesson: Organization saves hours. When things break at hour 6, you'll be grateful for clean logs and backed-up configs.

Phase 2: Server Setup (Where Things Got Real)

Now we SSH into the actual H100 server to prepare the environment.

Firewall Configuration (Critical Security)

Several Gonka ports must be blocked from the internet to prevent exploitation. This was one area where Claude Code's execution capability was essential.

Allowed (public access):

sudo ufw allow 22/tcp      # SSH

sudo ufw allow 5000/tcp    # P2P networking

sudo ufw allow 26657/tcp   # Blockchain RPC

sudo ufw allow 8000/tcp    # Public API

Blocked (internal services only):

sudo ufw deny 9100/tcp     # Prometheus metrics

sudo ufw deny 9200/tcp     # Internal monitoring

sudo ufw deny 8080/tcp     # Internal proxy

sudo ufw deny 5050/tcp     # Inference endpoint

sudo ufw enable

Why this matters: I initially had all ports open (rookie mistake). Claude Code caught this during a security review and locked it down before going live.

The .env vs config.env Problem

This one would've stumped me for an hour without execution help.

The Gonka setup uses a config.env file with this format:

export KEY_NAME=my-key

export ACCOUNT_ADDRESS=gonka1...

Problem: Docker Compose doesn't understand export statements. You need a plain .env file:

KEY_NAME=my-key

ACCOUNT_ADDRESS=gonka1...

How ChatGPT would handle this:

  • You: "Docker Compose fails with syntax error"
  • ChatGPT: "Check your .env file format. Remove export statements."
  • You: Manually edit file, hope you got everything
  • You: Still getting errors, more back-and-forth

How Claude Code handled this:

# Automatically converted the file

sed 's/^export //' config.env > .env

# Validated syntax

docker compose config

# Confirmed: "Configuration valid ✓"

Time saved: ~20 minutes of trial-and-error

Phase 3: Model Configuration (The Hiccup)

Not every mistake needs to be dramatic—sometimes you just miss a line in the docs.

The Initial Config (Following Official Docs)

I cloned the Gonka repository and looked at node-config.json:

{

  "id": "h100-node1",

  "host": "inference",

  "models": {

    "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8": {

      "args": [

        "--tensor-parallel-size=4",

        "--max-model-len=40960"

      ]

    }

  }

}

My thinking:

  • "235B parameters = best quality = most earnings"
  • "The official example uses this, so it must work"
  • "Let's download the biggest model!"

The Reality Check

After 6 hours of setup, node synced perfectly, but:

nvidia-smi

# GPU Memory: 0 MiB / 81,559 MiB

# Utilization: 0%

Nothing loaded. ML node logs showed:

ERROR: The number of required GPUs exceeds the total number of available GPUs

Available GPUs: 1

Required GPUs: 4 (tensor-parallel-size: 4)

Waiting for 3 additional GPUs...

The math:

  • Qwen3-235B = 235 billion parameters
  • FP8 quantization ≈ 1 byte per parameter
  • Model size ≈ 235GB minimum
  • Add KV cache +40GB
  • Total needed : ~275GB
  • What I had : 81GB H100
  • Original config expectation : 4× H100 = 324GB ✅

The Fix: Single-GPU Configuration

The docs DO have a single-GPU configuration—I just missed it because the multi-GPU example is more prominent.

Correct config for single H100:

{

  "id": "h100-node1",

  "host": "inference",

  "models": {

    "Qwen/Qwen3-32B-FP8": {

      "args": []  // Empty! Let vLLM auto-configure

    }

  }

}

The fix process:

# Stop everything

docker compose down

# Delete configuration cache (CRITICAL!)

sudo rm -rf .dapi

# Update node-config.json with 32B model

# (Claude Code edited this directly)

# Download correct model (happens automatically on restart)

docker compose -f docker-compose.yml -f docker-compose.mlnode.yml up -d

# Wait 5 minutes for model download and loading...

# Success!

nvidia-smi

# GPU Memory: 73,223 MiB / 81,559 MiB ✅

# vLLM process running ✅

Key lessons:

  1. Bigger ≠ better (32B model is perfectly adequate)
  2. Always check if docs examples match YOUR hardware
  3. Configuration caching (.dapi directory) will bite you—delete it when changing configs
  4. Auto-configuration (args: []) is sometimes smarter than manual tuning

Time lost: 3 hours debugging Knowledge gained: Deep understanding of model quantization, tensor parallelism, and VRAM allocation

Phase 4: State Sync Issues (The Database That Wasn't Empty)

After fixing the model, I restarted everything. Node stuck at block height 0.

The error:

failed to restore snapshot

error="multistore restore: import failed:

found database at version 1962000, must be 0"

Initial Misunderstanding

My first thought: "Database is corrupted!"

Reality: Database wasn't empty. State sync requires a completely clean slate.

The Actual Problem

Previous failed attempts left data in:

  • .inference/ - Blockchain data
  • .dapi/ - API configuration cache
  • .tmkms/ - Key management state

The database version mismatch (had 1962000, needed 0) wasn't corruption—just residual state.

The Solution (Nuclear Option)

# Stop services

docker compose down

# Clean slate

sudo rm -rf .inference .dapi .tmkms

# Start fresh

docker compose -f docker-compose.yml -f docker-compose.mlnode.yml up -d

# Monitor sync progress

watch -n 5 'curl -s localhost:26657/status | grep latest_block_height'

The Sync Process

State sync stages:

  1. Download 504 snapshot chunks from peers
  2. Apply chunks to rebuild database
  3. IAVL tree upgrade (processes versioned tree structures)
  4. Catch up remaining blocks

Time: ~30 minutes to full sync (reached block 1,960,000+)

Lesson: When documentation says "fresh install," they mean FRESH. Don't try incremental debugging with old state. 30 minutes of re-sync beats 3 hours of debugging corrupted state.

Phase 5: Validation & Monitoring (Building a Dashboard with AI)

Proof of Life

First thing after sync: test if inference actually works.

curl -X POST http://localhost:5050/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "Qwen/Qwen3-32B-FP8",

    "messages": [{"role": "user", "content": "Hello!"}],

    "max_tokens": 50

  }'

# Response time: ~2 seconds

# GPU spiked to 100% utilization ✅

# Got coherent AI response ✅

It's alive!

Validation Activity

ML node logs showed the node doing its job:

Stats: 76 validated, 1 fraud

fraud_detected=False

p_honest=1.000000

What this means:

  • Node validated 76 batches from other nodes
  • Detected 1 fraudulent result (catching cheaters!)
  • 100% honest in own computations
  • Contributing to network security ✅

The Monitoring Problem

After getting everything working, a new problem emerged: visibility.

The manual way (what I started with):

Checking node health required 5+ SSH commands:

# Is the node synced?

ssh user@server "curl -s localhost:26657/status | jq '.result.sync_info'"

# How's the GPU?

ssh user@server "nvidia-smi"

# Are services running?

ssh user@server "cd gonka/deploy/join && docker compose ps"

# What's my balance?

curl http://server:8000/v1/participants/gonka1547... | jq '.balance'

# Validation count?

ssh user@server "docker compose logs mlnode-308 | grep 'Stats:' | tail -1"

Time to check everything: ~5 minutes How often I checked: Every 30 minutes (paranoia about issues) Daily time wasted: ~2.5 hours

This was unsustainable.

Building the Dashboard: ChatGPT vs Claude Code (The Showdown)

I needed a custom real-time monitoring dashboard. Requirements:

✅ Single command to run

✅ Real-time auto-refresh

✅ Beautiful terminal UI

✅ All metrics in one view

✅ Color-coded health status

✅ Historical GPU utilization tracking

Tech stack:

  • Python 3.10+
  • rich library (gorgeous terminal UI)
  • httpx (async HTTP requests)
  • uv (fast Python package manager)

Attempt 1: ChatGPT for Dashboard (The Failure)

Here's what actually happened when I tried ChatGPT first:

  1. Me : "Help me build a Python dashboard for monitoring a blockchain node"
  2. ChatGPT : Provides nice example code using the rich library
  3. Me : Copy-paste to dashboard.py, run it
  4. Error : ModuleNotFoundError: No module named 'rich'
  5. Me : "I get ModuleNotFoundError for rich"
  6. ChatGPT : "Install it with: pip install rich"
  7. Me : Tries pip, different error (virtual environment confusion, dependency conflicts)
  8. Me : "Now I get [dependency resolution error]"
  9. ChatGPT : "Try this instead..." suggests different approach
  10. 30 minutes later : Still debugging package installation

The core problem: ChatGPT can't see my environment, doesn't know I have uv installed, can't run commands to verify if anything works.

Attempt 2: Claude Code for Dashboard (The Success)

  1. Me : "Build a dashboard to monitor my Gonka node"
  2. Claude Code :
  3. Me : Runs command
  4. Dashboard appears, working perfectly

Total time: 15 minutes for initial working version

The Key Differences (For Non-Technical Users)

Claude vs ChatGPT

Real Example: The Peer Count Bug

A few hours after the dashboard was running, I noticed something odd.

With ChatGPT (hypothetical):

Me: "Dashboard shows 1 peer but I expected more"

ChatGPT: "The issue might be in how you're querying the RPC endpoint.

Try using this code instead: [provides 300 lines of example code]"

Me: [Reads code, tries to figure out which part to change]

Me: [Edits wrong file, still shows 1]

ChatGPT: "Also check your network configuration..."

Me: [30 minutes of trial and error]

With Claude Code (what actually happened):

Me: "Dashboard shows 1 peer but I expected more"

Claude Code:

  • Reads fetchers.py (my actual file)
  • Identifies the bug: line 67 using voting_power instead of n_peers
  • "Found it. You're reading validator info instead of network info."
  • Edits fetchers.py with correct API endpoint (/net_info)
  • Runs curl to verify the endpoint returns correct data
  • Tests dashboard: "Now showing 21 peers ✓"

Total time: 3 minutes

This happened multiple times during development. Each fix: minutes instead of hours.

Development Time Comparison

Traditional manual approach (estimated):

  • Day 1: Research Rich library, setup project (4-6 hours)
  • Day 2: Build data fetchers, SSH integration (6-8 hours)
  • Day 3: Create UI panels, layouts, colors (4-6 hours)
  • Day 4: Debug, test, refine (3-4 hours)
  • Total : 17-24 hours over 4 days

With Claude Code (actual):

  • Hour 1: Described requirements, reviewed generated plan
  • Hour 2: Implemented core fetchers and data collection
  • Hour 3: Built Rich UI panels with proper layouts
  • Hour 4: Added GPU utilization tracking and refinements
  • Total : ~4 hours in one evening

Time saved: 15-20 hours

How Claude Code Helped (Beyond Just Speed)

1. Instant Project Structure

  • Generated modular architecture (fetchers.py, display.py, config.py)
  • Proper separation of concerns
  • Best practices for Python package management with uv

2. SSH Integration Done Right

  • Secure command execution with proper timeouts
  • Error handling for network failures
  • Graceful degradation when data unavailable

3. Rich Library Expertise

  • Complex layouts (nested panels, columns, tables)
  • Color schemes based on health thresholds
  • Auto-refreshing Live display
  • Key point : I'd never used Rich before—would've taken hours to learn from docs

4. Real-Time Debugging

  • Fixed peer count display bug (showed 1 instead of 21)
  • Corrected service visibility (showed 4 instead of 8)
  • Added validation tracking on request
  • Each fix: minutes instead of hours

The Dashboard Architecture

# dashboard.py - Entry point

# - CLI argument parsing

# - Main loop with keyboard input handling (q to quit, r to refresh)

# fetchers.py - Data collection

# - fetch_node_status() → RPC endpoint for blockchain data

# - fetch_participant_info() → API for balance/rewards

# - fetch_docker_status() → SSH to check services

# - fetch_gpu_status() → nvidia-smi via SSH

# - fetch_validation_stats() → Parse ML node logs

# - fetch_today_utilization_stats() → Historical tracking

# display.py - Rich UI rendering

# - create_node_panel() → Blockchain sync status

# - create_participant_panel() → Account info

# - create_services_panel() → Docker containers (2-column grid)

# - create_gpu_panel() → GPU metrics + validations

# - create_system_panel() → Disk, uptime, model status

# - create_layout() → Combine all panels

# config.py - Configuration

# - Server SSH details

# - API endpoints

# - Refresh intervals

# - Health thresholds (temp, disk space)

# utilization_tracker.py - Historical data

# - Records GPU usage every 5 seconds

# - Calculates daily active time (>0% utilization)

# - Stores in local JSON file

Key Features Implemented

1. Real-Time Sync Status

  • Block height with comma formatting (1,979,450)
  • Catching up vs Synced indicator
  • Time since last block ("6s ago")

2. GPU Monitoring

  • Utilization percentage
  • Memory used/total (71.5GB / 79.6GB)
  • Temperature with color warnings:
  • Green: <70°C
  • Yellow: 70-80°C
  • Red: >80°C
  • Model loaded confirmation

3. Validation Tracking

  • Batches validated count
  • Fraud detected count
  • Last validation time with recency colors:
  • Green: seconds/minutes ago
  • Yellow: 1-2 hours ago
  • Red: >2 hours or never
  • Always visible even if zero

4. Service Health

  • All 8 services displayed (including stopped ones)
  • Status symbols (✓ green, ✗ red)
  • Running count (7/8 services)

5. Historical Analytics

  • GPU active time for current day
  • Percentage of day utilized (>0% threshold)
  • Stored locally, resets at midnight

The Result

uv run dashboard.py

gonka-node-dashboard

[Press 'q' to quit, 'r' to refresh] Refreshing in 5s...

Impact

Before Dashboard:

  • Health check time: 5 minutes (manual SSH commands)
  • Frequency: Every 30 minutes
  • Daily time spent: 2.5 hours
  • Visibility: Snapshots only

After Dashboard:

  • Health check time: 0 seconds (always visible)
  • Frequency: Continuous (auto-refresh every 5s)
  • Daily time saved: 2.5 hours
  • Visibility: Real-time + historical trends

Development time saved by Claude Code: ~15-20 hours

"Claude Code turned a 4-day dashboard project into a 4-hour sprint. The best part? It didn't just write code—it taught me the Rich library patterns while building. I learned by collaborating, not just copy-pasting."

The "Working vs Earning" Mystery

Dashboard running beautifully. All metrics green. Then I noticed something:

Current Status (After 26 hours operational):

  • Block Height : 1,979,450 (fully synced) ✅
  • Peers : 21 of 4,494 network participants ✅
  • GPU : Model loaded, ready for work ✅
  • Validations : 76 batches validated ✅
  • Balance : 0.00 GONKA ⏳
  • Last Validation : 14 hours ago 🚨

Everything looks perfect... except:

  1. Zero earnings after 26 hours
  2. No validation activity for 14+ hours
  3. GPU sitting idle despite being ready

Is the node working? Yes. Is it earning? Who knows.

This becomes important in understanding how the network actually operates...

Understanding the Gonka Ecosystem

With the dashboard revealing the "working but not earning" mystery, I needed to understand how the network actually functions.

Network Overview

Total Participants: 4,494 nodes Active Earners: 1,867 nodes (41.5% of network) My Position: Connected to 21 peers (0.47% of network) Total Network Balance: 19.15 billion GONKA

Is 21 Peers Enough?

When I first saw "21 peers" I worried: "Shouldn't I have more connections to a 4,494-node network?"

Answer: No. Here's why 21 peers is actually perfect:

  • Information propagates : Your 21 peers connect to their peers, who connect to theirs
  • Hop count : Entire network reachable in 3-4 hops
  • Bandwidth efficiency : More peers = wasted bandwidth with redundant messages
  • Redundancy : Even if 20 peers fail, you're still connected
  • Optimal range : 10-30 peers is ideal for blockchain P2P networks

The Incentive Model (Proof of Compute 2.0)

This isn't just about running inference. The economic model is more nuanced:

1. You Earn for Computational Work

  • Process inference requests
  • Validate other nodes' work
  • Detect fraudulent results

2. Epoch System

  • 15,552 blocks per epoch (~48 hours)
  • Rewards calculated and distributed at epoch end
  • Currently in epochs 0-180: Grace period (zero-cost inference for users)

3. What Affects Your Earnings

  • GPU active time (more work = more rewards)
  • Successful validations
  • Network selection (probabilistic—you can't force it)
  • Node uptime and reliability

Why Zero Balance After 26 Hours?

Reason 1: Epoch Timing

  • I started during Epoch 126
  • Only operational for last 10.6 hours of that epoch
  • Rewards distribute at epoch end
  • First full epoch (127) still in progress

Reason 2: Wrong Model During Epoch 126

  • First 6 hours: Had 235B model configured (didn't load)
  • Last 4.6 hours: Had correct 32B model (actually working)
  • So very limited participation in Epoch 126

Expected first rewards: End of Epoch 127 (in ~19 hours from current time)

Why No Validation Activity for 14 Hours?

This one's still a mystery. Possibilities:

  1. Network Selection : Validation work assigned probabilistically
  2. Low Network Demand : Maybe just not many validation tasks during this period
  3. Timing : ML node restarted right at Epoch 127 start (coincidence?)
  4. Configuration : Possible issue preventing work assignment?

Status: Monitoring to see if activity resumes...

Lessons Learned

After 8 hours of setup, 26 hours of operation, and building a custom dashboard, here's what I wish I'd known from the start.

Technical Lessons

1. Use the Right Tool for Infrastructure Work

This is the #1 lesson.

  • Chat-based AI (ChatGPT, Claude.ai): Great for learning concepts, brainstorming
  • CLI-based AI (Claude Code): Essential for actual implementation
  • For non-technical users : Execution capability is the difference between success and frustration

Rule of thumb: If you're editing files and running commands, use a tool that can do both.

2. RTFM, But Verify

  • Official docs are a starting point
  • Single-GPU setup needs different config than examples
  • Always check if examples match YOUR hardware
  • The 4-GPU example is prominent; single-GPU config is buried

3. Model Selection is Critical

Calculate VRAM requirements FIRST:

Formula: (Parameters × bytes-per-param) + KV cache

Example:

  • Qwen3-235B: (235B × 1 byte) + 40GB = 275GB → Need 4 GPUs
  • Qwen3-32B: (32B × 1 byte) + 40GB = 72GB → Fits 1 GPU

Bigger model ≠ more earnings. Right-sized model = reliability.

4. Configuration Caching Will Bite You

Hidden state in .dapi, .inference directories means config changes get ignored if cache exists.

Solution: When changing major configs, delete cache directories and restart fresh.

Time saved: Nuclear option (delete all) takes 30 minutes. Debugging cached config takes 3 hours.

5. State Sync Requires Clean Slate

"Fresh install" means ZERO residual data. Don't try to debug with old state hanging around.

30 minutes of re-sync < 3 hours of debugging corrupted state.

6. Monitoring is Essential

  • Build (or have Claude Code build) a dashboard on day one
  • Log everything – you'll need it
  • Automated health checks save hours of manual SSH commands

Operational Lessons

1. Security First

  • Configure firewall rules BEFORE going live
  • Internal ports must be blocked (9100, 9200, 8080, 5050)
  • Regular security audits
  • Use separate operational keys (not your main account key)

2. Backup Everything Critical

  • Mnemonic phrase : Offline + encrypted (paper + USB + password manager)
  • Configuration files : Version controlled
  • Passwords : Secure password manager

What NOT to backup: Server data (blockchain will re-sync, no backup needed)

3. Time Estimation

  • Official estimate: 2 hours
  • With AI assistance: 8 hours
  • Without AI assistance: 16-20 hours or give up
  • Budget double the optimistic estimate

4. Community Resources

  • GitHub issues are gold for troubleshooting
  • Discord/community often faster than docs
  • Share your learnings (like this post!)

Economics Lessons

1. Rewards Take Time

  • Not instant gratification
  • First rewards: After completing a full epoch
  • ROI: Long-term perspective needed

2. Network Participation is Probabilistic

  • Uptime affects earnings
  • Validation work assigned randomly
  • More nodes = more competition
  • Can't force the network to send you work

3. Hardware Investment

  • H100 GPU: $25k-30k
  • Electricity: Non-trivial ongoing cost
  • Calculate break-even point before starting

What I'd Do Differently

Keep Doing:

  • Detailed logging from the start
  • Security-first approach
  • Building monitoring tools early
  • Using Claude Code for infrastructure work

Change:

  • Read single-GPU config examples first (before spending 3 hours on wrong model)
  • Delete all state between attempts earlier (save debugging time)
  • Set realistic time expectations (8 hours, not 2)
  • Test inference immediately after model loads (don't wait to discover it didn't load)

"The best debugging tool? A clean slate and fresh eyes. The second best? Detailed logs from when things worked. The third best? An AI assistant that can actually execute commands and verify results."

Current Status & Next Steps

Node Health (As of 26 Hours Operational)

✅ Blockchain: Synced to block 1,979,450

✅ Network: 21 healthy peers

✅ GPU: 71.5GB model loaded, ready

✅ Services: 7/8 running (bridge stopped, not critical)

✅ Validations: 76 batches completed

⏳ Rewards: First distribution in ~19 hours (end of Epoch 127)

What's Working

  • Model inference : 2-second response times ✅
  • Proof of Compute : Successfully validated 76 batches ✅
  • Network participation : Active in epoch 127 ✅
  • Monitoring : Real-time dashboard functional ✅

What's Not (Yet)

  • Bridge service : Stopped due to errors (not critical for ML operations)
  • Zero balance : Waiting for epoch completion
  • No recent validation work : Last 14+ hours quiet (investigating)

Immediate Next Steps

1. Monitor First Rewards (in ~19 hours)

  • Will validate that setup is actually earning
  • Baseline for future earnings projections
  • Proof of concept success

2. Investigate Validation Silence

  • Why no work assigned for 14 hours?
  • Network selection algorithm behavior?
  • Configuration issue?
  • Just probabilistic variance?

3. Bridge Service (lower priority)

  • Ethereum cross-chain functionality
  • Not critical for ML inference operations
  • Revisit after stable earnings confirmed

4. Optimization

  • Fine-tune max_model_len parameter
  • Monitor for optimal concurrent request handling
  • Balance performance vs. reliability

Long-Term Goals

  • Earnings Analysis : Track ROI over 30/60/90 days
  • Scaling : Consider multi-GPU setup if economics work out
  • Community : Share dashboard tool as open source
  • Documentation : Contribute single-GPU guide back to Gonka docs

The Bigger Picture

This isn't just about running a node – it's about:

  • Decentralization : Contributing to distributed AI infrastructure
  • Learning : Deep dive into blockchain + ML systems
  • Economics : Exploring crypto-incentivized compute markets
  • Tooling : Discovering how AI assistance changes what's possible for non-experts

Community: Building in public, sharing learnings

Conclusion

Was It Worth It?

The honest answer: Ask me in 90 days when I have earnings data.

What made it worth it already:

Technical Learning

  • Deep understanding of vLLM and model quantization
  • Practical blockchain infrastructure experience
  • GPU resource management at scale
  • Distributed systems debugging skills

The Right Tools Matter

  • Discovered the power of agentic AI for infrastructure work
  • Claude Code turned 8-hour debugging sessions into 15-minute fixes
  • For non-technical users : This is the key enabler
  • Traditional chat AI ≠ infrastructure automation AI

Problem-Solving Skills

  • Model selection crisis → systematic debugging
  • State sync issues → clean slate philosophy
  • Configuration mysteries → cache awareness
  • Dashboard development → AI-accelerated from days to hours

Community Contribution

  • Custom monitoring dashboard (will share as open source)
  • Documented single-GPU pitfalls
  • Real-world validation of official docs
  • Proof that non-experts can run complex infrastructure with the right assistance

Unanswered Questions

❓ Actual earning potential (waiting for data) ❓ Long-term stability and uptime requirements ❓ Network growth impact on individual earnings ❓ Future model updates and compatibility

Who Should Run a Gonka Node?

Good Fit (With Claude Code or similar AI assistant):

  • Basic command line comfort (not expertise required!)
  • Patient with troubleshooting
  • Interested in decentralized AI
  • Have GPU hardware already
  • Willing to learn alongside AI assistance
  • Note : You DON'T need to be a DevOps expert anymore

Good Fit (Without AI Assistance):

  • DevOps/SRE background required
  • Deep command line expertise
  • Infrastructure debugging experience
  • Blockchain familiarity helpful

Not Ready:

  • Expecting easy passive income
  • Zero willingness to troubleshoot
  • Expecting instant ROI
  • Not willing to use AI coding assistants (makes it MUCH harder)

Parting Wisdom

"Setting up a Gonka node isn't a 2-hour tutorial. It's an 8-hour debugging session that teaches you more about distributed AI infrastructure than any course could. But here's the thing: with Claude Code, those 8 hours are collaborative pair programming, not solo frustration. Come prepared to learn, not just to earn—and bring the right AI tools to the party."

Call to Action

  • Try it : Official quickstart at
  • Dashboard : I'll open-source the monitoring tool soon
  • Follow along : I'll post updates on earnings after 30/60/90 days
  • Connect : Share your own experiences in the comments

Final Stats

  • Setup time : 8 hours (with Claude Code)
  • Estimated time without AI : 16-20 hours or would've given up
  • Major challenges faced : 3
  • Minor issues : 5+
  • Coffee consumed : Too much
  • Lessons learned : Invaluable
  • Would I do it again? : Absolutely
  • Would I do it without Claude Code? : Probably not

Appendices

Appendix A: Command Reference

Sanitized examples of key commands used:

# Prerequisites check

nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv

docker --version

docker compose version

# Firewall setup

sudo ufw allow 22/tcp

sudo ufw allow 5000/tcp

sudo ufw allow 26657/tcp

sudo ufw allow 8000/tcp

sudo ufw deny 9100/tcp

sudo ufw deny 9200/tcp

sudo ufw enable

# Node setup

git clone https://github.com/gonka-ai/gonka.git

cd gonka/deploy/join

# Edit .env and node-config.json

docker compose -f docker-compose.yml -f docker-compose.mlnode.yml up -d

# Monitoring

curl -s localhost:26657/status | jq '.result.sync_info'

docker compose logs mlnode-308 --tail=50

nvidia-smi

docker compose ps

# Dashboard (if you build it)

cd ~/gonka-setup

uv run dashboard.py

Appendix B: Troubleshooting Quick Reference

symptoms-vs-solutions

Appendix C: Resources & Links

Official:

  • Gonka Docs:
  • GitHub:
  • Model Hub:

Tools:

  • Claude Code:
  • NVIDIA Drivers:
  • Docker:
  • Python uv:

Learning:

  • vLLM Documentation:
  • Rich (Python TUI):
  • Tendermint/CometBFT:

Node Status Timeline

  Setup Date: January 1-2, 2026

  Initial Status: ✅ Operational & Synced (8/8 services)

  First Rewards: Pending (Epoch 127 completion)

  ---

  First Earnings - January 4, 2026 (Day 3)

  - Epoch 127 Completed: ✅ 11,289 GONKA earned (28 coins)

  - Status: First rewards successfully vested

  - Conversion Rate: ~403 GONKA per coin

  - Services: 7/8 running (bridge intentionally stopped)

---

  Interim Update - January 7, 2026 (Day 6)

  - Current Epoch: 131 (20% complete)

  - Epochs Completed: 127-130 (4 epochs)

  - Total Earnings: 79,946 GONKA

    - Vesting: 79,490 GONKA (unlocks over ~198 days)

    - Liquid Balance: 456 GONKA

  - Recent Performance:

    - E130: 10,880 coins → 32,681 GONKA (3.00 GONKA/coin)

    - E129: 3,700 coins → 19,520 GONKA (5.28 GONKA/coin)

    - E128: 1,300 coins → 16,000 GONKA (12.31 GONKA/coin)

  - Work Activity: 78 validations completed, 3 fraud detections

  - Network Position: 1,867 / 4,494 active participants

---

  Latest Update - January 10, 2026 (Day 9)

  - Current Epoch: 132 (36.5% complete)

  - Epochs Completed: 127-131 (5 epochs)

  - Total Earnings: 99,455 GONKA (+24% in 3 days)

    - Vesting: 98,555 GONKA

    - Liquid Balance: 900 GONKA (+97%)

  - Epoch 131 Performance: 129 coins → ~20,468 GONKA (158.67 GONKA/coin) 🚀

    - 53x better conversion rate than Epoch 130!

  - Work Activity: 482 total validations (+404 in 3 days)

  - Network Position: 3,403 / 4,933 active participants

  - Vesting Schedule: ~401 GONKA unlocking per day


Note: This blog post contains no sponsored content. All opinions are based on actual experience. Some details (IP addresses, account addresses) have been sanitized for security.

Olena Tkhorovska

Olena Tkhorovska

CEO + Co-Founder