Skip to main content

Three Architectures — Which One Do You Need?

TalkifAI agents come in three types:
Pipeline = voice agent — assembly line of STT → LLM → TTS Realtime = voice agent — one provider handles audio end-to-end Text = chat-only agent — no audio, REST API + streaming text

Voice vs Text — The Key Rule

Voice agents (Pipeline / Realtime) can do BOTH voice calls AND text chat.Text agents can ONLY do text chat — they cannot handle voice calls.
Pipeline Agent  →  ✅ Voice calls  +  ✅ Text chat (Chat API)
Realtime Agent  →  ✅ Voice calls  +  ✅ Text chat (Chat API)
Text Agent      →  ❌ Voice calls  +  ✅ Text chat (Chat API) only
CapabilityPipelineRealtimeText
Inbound phone calls
Outbound calls
Browser voice test
Text Chat API
Website chat widget
Requires STT/TTS setup
Lowest cost
Which to choose? If you need voice calls, pick Pipeline or Realtime. If you only need a website chat widget or text API, pick Text — it’s simpler and cheaper.

Pipeline Architecture

How it works

User speaks  →  [Deepgram]    →  [GPT-4o]    →  [Cartesia]  →  User hears
                (Audio to         (Thinks and    (Converts
                 text)            responds)       to audio)
                  STT               LLM             TTS
Three separate, independently chosen services form a chain. You pick the best provider for each step.

When to choose Pipeline

Mix Providers

Deepgram STT + Gemini LLM + Cartesia TTS — pick the best from each category.

Control Costs

Use a cheaper STT + affordable LLM and keep the same quality at lower cost.

Specific Voice

A particular ElevenLabs voice you love? Only possible with Pipeline.

First Time Building

Pipeline is more forgiving. Better starting point for new users.

Provider Options

ProviderBest ForAccuracy
Deepgram Nova 2Real-time, most languagesExcellent
OpenAI WhisperHigh accuracy, non-EnglishExcellent
AssemblyAIAccents, noisy environmentsVery Good
Start with Deepgram Nova 2 — fastest and most accurate for the majority of use cases.

Realtime Architecture

How it works

User speaks ←→ [OpenAI Realtime / Gemini Live] ←→ User hears
               (One provider handles everything —
                no separate STT or TTS pipeline)
A single provider manages the entire audio loop. This is significantly faster and more natural.

When to choose Realtime

Speed is Critical

Sub-300ms response time — the conversation feels completely natural.

Natural Interruptions

Users can interrupt the agent mid-sentence — just like a real conversation.

Already Using OpenAI/Google

If you already have an OpenAI or Google key, Realtime delivers the best value.

Premium Experience

The highest quality conversation — as close to human as current AI allows.

Realtime Providers

ProviderModelWhat Makes It Special
OpenAI Realtimegpt-4o-realtime-previewBest quality, full function calling support
Gemini Livegemini-2.0-flashGoogle’s latest, competitive pricing
Realtime requires your own API key (OpenAI or Google). TalkifAI platform keys are not supported here. Add your key under Settings → API Keys.

Text Architecture

How it works

User types  →  [LLM]  →  Response text  →  User reads
               (No audio — pure text in, text out)
No STT, no TTS — just a language model responding to text messages. Accessed via the Chat API using REST + Server-Sent Events (SSE streaming).

When to choose Text

Website Chat Widget

Embed a chatbot on any website. No microphone required — users type their messages.

Lowest Cost

No STT or TTS costs. You only pay for LLM tokens — the cheapest option.

Mobile App Chat

Build in-app chat experiences where voice isn’t appropriate (e.g., in a meeting).

API-First Integrations

Integrate AI into your own app, CRM, or support portal via REST API.

Text Agent Providers

ProviderModelBest For
GPT-4o-miniOpenAIMost use cases — fast and affordable
GPT-4oOpenAIComplex reasoning, nuanced responses
Gemini 2.0 FlashGoogleFast, cost-effective alternative
Text agents support all the same features as voice agents: custom functions, knowledge base, subagents, memory (Graphiti), and post-call analysis — except anything audio-related.

Decision Guide — Which One to Pick?

Do you need voice calls (phone / browser)?

         ├── No → Use Text ✅ (cheapest, simplest)

         └── Yes → Building your first agent?

                      ├── Yes → Use Pipeline ✅

                      └── No → How important is latency?

                                   ├── Critical (sub-300ms) → Realtime

                                   └── Flexible (500–800ms ok) → Pipeline

Side-by-Side Comparison

Feature🔧 Pipeline⚡ Realtime
Response time500–800ms200–400ms
Provider choiceAny combinationOpenAI or Google only
Voice varietyAny TTS voice availableProvider’s built-in voices
Cost controlFine-grained per componentSingle provider pricing
Function calling✅ Fully supported✅ Supported (OpenAI Realtime)
Interruption handlingGoodExcellent
Own API key requiredOptionalRequired
Recommended for beginners✅ YesNot recommended

Switching Architecture Later

You can change architecture after creation — nothing is locked in:
  1. Open Studio → Agent Settings
  2. Select the new architecture type
  3. Configure the required fields
  4. Save and re-activate
Switching architecture resets your provider configuration. You will need to reconfigure all provider settings for the new architecture.

Next Steps

Create an Agent

Now that you understand architectures, build your agent.

Choose a Voice

TTS provider and voice selection guide.