Information on Chat Models
I wanted to create this page to compare this pricing / capabilities of different AI models, research how much it would cost to host an open source model, and get a better sense of what Large Language Models are out there.
Last Updated:11/14/2024
Some Terminology
Tokens
Tokens are words, character sets, or combinations of words and punctuation that are used by large language models (LLMs) to break down text.
Tokens can be single characters like z or whole words like cat. Long words are broken up into several tokens. The set of all tokens used by the model is called the vocabulary, and the process of splitting text into tokens is called tokenization.
The set of all tokens used by the model is called the vocabulary, and the process of splitting text into tokens is called tokenization. Tokens are used by the models below to calculate the cost of an API call. They have different methods for calculating tokens:
OpenAI
OpenAI Counting Tokens Reference
tiktoken is a fast open-source tokenizer by OpenAI. Given a text string and an encoding, a tokenizer can split the text string into a list of tokens. Knowing how many tokens that are in a text string can tell you whether the string is too long for a text model to process and how much an OpenAI API call costs. There are python libraries and JavaScript libraries that allow you to count OpenAI tokens (there are also libraries for other languages - see link above).
Gemini Counting Tokens Reference
Counting Tokens Code Example Gemini
For Gemini models, a token is equivalent to about 4 characters. 100 tokens is equal to about 60-80 English words. When billing is enabled, the cost of a call to the Gemini API is determined in part by the number of input and output tokens, so knowing how to count tokens can be helpful.
You can count tokens in the following ways:
- Call countTokens with the input of the request: This returns the total number of tokens in the input only. You can make this call before sending the input to the model to check the size of your requests.
- use the usageMetadata attribute on the response object after calling generate_content: This returns the total number of tokens in both the input and the output: totalTokenCount.
Anthropic
Anthropic Token Counting
To access this feature, include the anthropic-beta:token-counting-2024-11-01 header in your API requests, or use client.beta.messages.count_tokens in your SDK calls.
Token counting enables you to determine the number of tokens in a message before sending it to Claude, helping you make informed decisions about your prompts and usage. How to count tokens with Anthropic:
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
// Vallid Token Counting Models: Claude 2.5 Sonnet, Claude 3.5 Haiku, Claude 3 Haiku, Claude 3 Opus
const response = await client.beta.messages.countTokens({
betas: ["token-counting-2024-11-01"],
model: 'claude-3-5-sonnet-20241022',
system: 'You are a scientist',
messages: [{
role: 'user',
content: 'Hello, Claude'
}]
});
console.log(response); //{ "input_tokens": 14 }
Llama API
I haven't seen a way where you can count tokens with Llama API yet.
Abbreviations:
- LLM = Large Language Model
- RPM = Requests Per Minute
- TPM = Tokens per Minute
- RPD = Requests per Day
- Llama = Large Language Model Meta AI
Enterprise API Models
Looking into the differences between the Enterprise LLM providers. I want to see the cost breakdowns of each and what their model capabilities are.
OpenAI
Multiple models, each with different capabilities and price points. Prices can be viewed in units of either per 1M or 1K tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words.
Language models are also available in the Batch API (opens in a new window) that returns completions within 24 hours for a 50% discount.
Learn more about OpenAI pricing on the OpenAI API pricing page.
Models:
-
GPT-4o
GPT-4o mini is our most cost-efficient small model that's smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. The model has 128k context and an October 2023 knowledge cutoff.
-
GPT-4o mini
- GPT-4o mini is our most cost-efficient small model that's smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. The model has 128k context and an October 2023 knowledge cutoff.
-
OpenAI o1-preview
o1-preview is our new reasoning model for complex tasks. The model has 128k context and an October 2023 knowledge cutoff.
-
OpenAI o1-mini
- o1-mini is a fast, cost-efficient reasoning model tailored to coding, math, ans science use cases. The mosel has 128k context and an October 2023 knowledge cutoff.
Model | Input Tokens Pricing (/ 1M input Tokens) | Batched Input Tokens Pricing (/ 1M output Tokens) | Cached Input Token Pricing (/ 1M cached input Tokens) | Output Token Pricing (/ 1M output Tokens) | Batched Output Token Pricing (/ 1M output Tokens) |
---|---|---|---|---|---|
gpt-4o-mini | $0.150 | $0.075 | $0.075 | $0.600 | $0.300 |
gpt-4o-mini-2024-07-18 | $0.150 | $0.075 | $0.075 | $0.600 | $0.300 |
o1-preview | $15.00 | N/A | $7.50 | $60.00 | N/A |
o1-preview-2024-09-12 | $15.00 | N/A | $7.50 | $60.00 | N/A |
o1-mini | $3.00 | N/A | $1.50 | $12.00 | N/A |
o1-mini-2024-09-12 | $3.00 | N/A | $1.50 | $12.00 | N/A |
The Google Gemini API is a powerful tool that provides access to Google DeepMind's Gemini models. These models are designed to be multimodal, meaning they can understand and process various types of data, including text, images, code, and audio.
Find out more about Google Gemini Pricing on its pricing page. List of current Gemini models available through the API:
- Gemini 1.5 Flash
-
Rate Limits:
- 2,000 RPM
- 4,000,000 TPM
- Context Caching: $1.00 / 1,000,000 tokens per hour
- Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
Our fastest model with great performance for diverse, repetitive tasks and a 1 million context window.
-
Rate Limits:
-
Gemini 1.5 Flask-8B:
Our smallest model for lower intelligence use cases with a 1 million token context window.
-
Rate Limits:
- 4,000 RPM
- 4,000,000 TPM
- Context Caching: $0.25 / 1,000,000 tokens per hour
- Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
-
Rate Limits:
-
Gemini 1.5 Pro:
Our next generation model with a breathrough 2 million context window. Now generally availabel fro production use.
-
Rate Limits:
- 1,000 RPM
- 4,000,000 TPM
- Context Caching: $4.50 / 1,000,000 tokens per hour
- Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
-
Rate Limits:
-
Gemini 1.0 Pro:
Our first generation model offering only text and image reasoning. Generally available for production use.
-
Rate Limits:
- 360 RPM
- 120,000 TPM
- 30,000 TPM
- Context Caching: $4.50 / 1,000,000 tokens per hour
- Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
-
Rate Limits:
Model | Prompts up To 128k Tokens | Prompts Longer Than 128k Tokens | ||||
---|---|---|---|---|---|---|
Input Pricing | Output Pricing | Context Caching | Input Pricing | Output Pricing | Context Caching | |
Gemini 1.5 Flash: | $0.075 / 1 million tokens | $0.30 / 1 million tokens | $0.01875 / 1 million tokens | $0.15 / 1 million tokens | $0.60 / 1 million tokens | $0.0375 / 1 million tokens |
Gemini 1.5 Flash-8B: | $0.0375 / 1 million tokens | $0.15 / 1 million tokens | $0.01 / 1 million tokens | $0.075 / 1 million tokens | $0.30 / 1 million tokens | $0.02 / 1 million tokens |
Gemini 1.5 Pro: | $1.25 / 1 million tokens | $5.00 / 1 million tokens | $0.3125 / 1 million tokens | $2.50 / 1 million tokens | $10.00 / 1 million tokens | $0.625 / 1 million tokens |
Gemini 1.0 Pro: | $0.50 / 1 million tokens | $1.50 / 1 million tokens | N/A | $0.50 / 1 million tokens | $1.50 / 1 million tokens | N/A |
Anthropic
Anthropic is also the name of an AI research and safety company founded in 2021 by former OpenAI researchers. Its mission is to create AI systems that are aligned with human values and are safe and reliable. They focus on developing large language models and other AI tools while emphasizing ethical guidelines and safety, aiming to prevent harmful consequences associated with advanced AI.
Models:
-
Claude 3.5 Sonnet
- Our most intelligent model to date.
- 200K context window
- 50% discount with the Batches API
-
Claude 3.5 Haiku
- Fastest, most cost-effective model
- 200K Context Window
- 50% discount with the Batches API
-
Claude 3 Opus
- Powerful model for complex tasks
- 200K Context window
- 50% Discount with the Batches API*
Model | Input (/ 1M Tokens) | Prompt Caching Write (/ 1M Tokens) | Prompt Caching Read (/ 1M Tokens) | Output (/ 1M Tokens) |
---|---|---|---|---|
Claude 3.5 Sonnet | $3 | $3.75 | $0.30 | $15 |
Claude 3.5 Haiku | $1 | $1.25 | $0.10 | $5 |
Claude 3 Opus | $15 | $18.75 | $1.50 | $75 |
Hosted Llama APIs
The Llama family of models are open-source models released by Meta AI starting in February 2023. since these models are open source, you can download them and run them on your own server. Given that the cost of a GPU can be expensive, it can be cheaper to run the model through an external API service. The prices below apply to the llama api service.Title | Parameter Count | Price |
---|---|---|
Small | 0B-8B | $0.0004 / 1k Tokens |
Medium | 8B-30B | $0.0016 / 1k Tokens |
Large | >30B | $0.0028 / 1k Tokens |
Name | Author | Pricing (per 1K tokens) |
---|---|---|
llama3.2-11b-vision | meta | $0.0004 |
llama3.2-1b | meta | $0.0004 |
llama3.2-3b | meta | $0.0004 |
llama3.2-90b-vision | meta | $0.0028 |
llama3.1-405b | meta | $0.0036 |
llama3.1-70b | meta | $0.0028 |
llama3.1-8b | meta | $0.0004 |
llama3-70b | meta | $0.0028 |
llama3-8b | meta | $0.0004 |
gemma2-27b | $0.0016 | |
gemma2-9b | $0.0004 | |
mixtral-8x22b | mistral | $0.0028 |
mixtral-8x22b-instruct | mistral | $0.0028 |
mixtral-8x7b-instruct | mistral | $0.0028 |
mistral-7b | mistral | $0.0004 |
mistral-7b-instruct | mistral | $0.0004 |
llama-7b-32k | meta | $0.0028 |
llama2-13b | meta | $0.0016 |
llama2-70b | meta | $0.0028 |
llama2-7b | meta | $0.0016 |
Nous-Hermes-2-Mixtral-8x7B-DPO | mistral | $0.0004 |
Nous-Hermes-2-Yi-34B | custom | $0.0028 |
Qwen1.5-0.5B-Chat | custom | $0.0004 |
Qwen1.5-1.8B-Chat | custom | $0.0004 |
Qwen1.5-110B-Chat | custom | $0.0028 |
Qwen1.5-14B-Chat | custom | $0.0016 |
Qwen1.5-32B-Chat | custom | $0.0028 |
Qwen1.5-4B-Chat | custom | $0.0004 |
Qwen1.5-72B-Chat | custom | $0.0028 |
Qwen1.5-7B-Chat | custom | $0.0004 |
Qwen2-72B-Instruct | custom | $0.0028 |
How These APIs Work
In this section, I am going to take some notes on the documentation for the various API services and ideate on how to store messages in database.
Documentation Links:
- OpenAI API Documentation
- Google Gemini Documentation
- Llama API Documentation
- Anthropic Documentation
OpenAI
There are three types of roles when doing chat completion (AI chat) through OpenAI's chat completion API (and other enterprise models):
-
system
- Messages with the system role act as top-level instructions to the model, and typically describe what the model is supposed to do ans how it should generally behave and respond.
-
Example:
-
You are a helpful assistant that answers programming questions in the style of an ancient chinese emperor.
-
-
user
- User messages contain instructions that request a particular type of output from the model. You can think of user messages as the messages you might type in to ChatGPT as an end user.
-
assistant
- Messages with the assistant role are presumed to have been generated by the model, perhaps in a previous generation request. They can also be used to provide examples to the model for how it should respond to the current request - a technique known as few shot learning.
import OpenAI from "openai";
const openai = new OpenAI();
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{"role": "user", "content": "write a haiku about ai"}
]
});
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "write a haiku about ai"}
]
)
const { GoogleGenerativeAI } = require("@google/generative-ai");
const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const model = genAI.getGenerativeModel({ model: "gemini-1.5-flash" });
const prompt = "Explain how AI works";
const result = await model.generateContent(prompt);
console.log(result.response.text());
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content("Explain how AI works")
print(response.text)
The APIs for llama and Anthropic are both very similar to the OpenAI and Google APIs. They can easily be found using the links above.
How to Store Chats in Database?
Hosting Your Own Model
This section will basically go over some AWS prices. To run a LLM, you will need an EC2 instance with a GPU, so I am going to be looking into that. I will also be looking into the prices of additional RAM.
We recommend a GPU instance for most deep learning purposes. Training new models is faster on a GPU instance than a CPU instance. You can scale sub-linearly when you have multi-GPU instances or if you use distributed training across many instances with GPUs.
Instance Name: | On-Demand Hourly Rate | vCPU | Memory | Storage | Network Performance | GPUs |
---|---|---|---|---|---|---|
p5e.48xlarge | $108.152 | 192 | 2048 GiB | 8 x 3840 GB SSD | 3200 Gigabit | up to 8 NVIDIA Tesla H200 GPUs |
p5.48xlarge | $98.32 | 192 | 2048 gIb | 8 X 3840 GB SSD | 3200 Gigabit | up to 8 NVIDIA Tesla H100 GPUs |
p4d.24xlarge | $32.7726 | 96 | 1152 GiB | 8 x 1000 SSD | 400 Gigabit | up to 8 NVIDIA Tesla A100 GPUs |
p3.2xlarge | $3.06 | 8 | 61 GiB | EBS Only | Up to 10 Gigabit | up to 8 NVIDIA Tesla V100 GPUs |
p3.8xlarge | $12.24 | 32 | 244 GiB | EBS Only | 10 Gigabit | up to 8 NVIDIA Tesla V100 GPUs |
p3.16xlarge | $24.48 | 64 | 488 GiB | EBS Only | 25 Gigabit | up to 8 NVIDIA Tesla V100 GPUs |
g3.4xlarge | $1.14 | 16 | 122 GiB | EBS Only | Up to 10 Gigabit | up to 4 NVIDIA Tesla M60 GPUs |
g3.8xlarge | $2.28 | 32 | 244 GiB | EBS Only | 10 Gigabit | up to 4 NVIDIA Tesla M60 GPUs |
g3.16xlarge | $4.56 | 64 | 488 GiB | EBS Only | 20 Gigabit | up to 4 NVIDIA Tesla M60 GPUs |
g3s.xlarge | $0.75 | 4 | 30.5 GiB | EBS Only | 10 Gigabit | up to 4 NVIDIA Tesla M60 GPUs |
g4ad.xlarge | $0.37853 | 4 | 16 GiB | 150 GB NVMe SSD | Up to 10 Gigabit | up to 4 NVIDIA T4 GPUs |
g4ad.2xlarge | $0.54117 | 8 | 32 GiB | 300 GB NVMe SSD | Up to 10 Gigabit | up to 4 NVIDIA T4 GPUs |
g4ad.4xlarge | $0.867 | 16 | 64 GiB | 600 GB NVMe SSD | Up to 10 Gigabit | up to 4 NVIDIA T4 GPUs |
g4ad.8xlarge | $1.734 | 32 | 128 GiB | 1200 GB NVMe SSD | 15 Gigabit | up to 4 NVIDIA T4 GPUs |
g4ad.16xlarge | $3.468 | 64 | 256 GiB | 2400 GB NVMe SSD | 25 Gigabit | up to 4 NVIDIA T4 GPUs |
g4dn.xlarge | $0.526 | 4 | 16 GiB | 125 GB NVMe SSD | Up to 25 Gigabit | up to 4 NVIDIA T4 GPUs |
g4dn.2xlarge | $0.752 | 8 | 32 GiB | 225 GB NVMe SSD | Up to 25 Gigabit | up to 4 NVIDIA T4 GPUs |
g4dn.4xlarge | $1.204 | 16 | 64 GiB | 225 GB NVMe SSD | Up to 25 Gigabit | up to 4 NVIDIA T4 GPUs |
g5.xlarge | $1.006 | 4 | 16 GiB | 1 x 250 GB NVMe SSD | Up to 10 Gigabit | up to 8 NVIDIA A10G GPUs |
g5.2xlarge | $1.212 | 8 | 32 GiB | 1 x 450 GB NVMe SSD | Up to 10 Gigabit | up to 8 NVIDIA A10G GPUs |
g5.4xlarge | $1.624 | 16 | 64 GiB | 1 x 600 GB NVMe SSD | Up to 25 Gigabit | up to 8 NVIDIA A10G GPUs |
g5.8xlarge | $2.448 | 32 | 128 GiB | 1 x 900 GB NVMe SSD | 25 Gigabit | up to 8 NVIDIA A10G GPUs |
g5.12xlarge | $5.672 | 48 | 192 GiB | 1 x 3800 GB NVMe SSD | 40 Gigabit | up to 8 NVIDIA A10G GPUs |
g5.16xlarge | $4.096 | 64 | 256 GiB | 1 x 1900 GB NVMe SSD | 25 Gigabit | up to 8 NVIDIA A10G GPUs |
g5.24xlarge | $8.144 | 96 | 384 GiB | 1 x 3800 GB NVMe SSD | 50 Gigabit | up to 8 NVIDIA A10G GPUs |
g5.48xlarge | $16.288 | 192 | 768 GiB | 2 x 3800 GB NVMe SSD | 100 Gigabit | up to 8 NVIDIA A10G GPUs |
g6.xlarge | $0.8048 | 4 | 16 GiB | 1 x 250 GB NVMe SSD | Up to 10 Gigabit | up to 8 NVIDIA L40S Tensor Core GPU |
g6.2xlarge | $0.9776 | 8 | 32 GiB | 1 x 450 GB NVMe SSD | Up to 10 Gigabit | up to 8 NVIDIA L40S Tensor Core GPU |
g6.4xlarge | $1.3232 | 16 | 64 GiB | 1 x 600 GB NVMe SSD | Up to 25 Gigabit | up to 8 NVIDIA L40S Tensor Core GPU |
g6.8xlarge | $2.0144 | 32 | 128 GiB | 2 x 450 GB NVMe SSD | 25 Gigabit | up to 8 NVIDIA L40S Tensor Core GPU |
g6.12xlarge | $4.6016 | 48 | 192 GiB | 4 X 940 GB NVMe SSD | 40 Gigabit | up to 8 NVIDIA L40S Tensor Core GPU |
g6.16xlarge | $3.3968 | 64 | 256 GiB | 2 x 940 GB NVMe SSD | 25 Gigabit | up to 8 NVIDIA L40S Tensor Core GPU |
g6.24xlarge | $6.6752 | 96 | 384 GiB | 4 X 940 GB NVMe SSD | 50 Gigabit | up to 8 NVIDIA L40S Tensor Core GPU |
g6.48xlarge | $13.3504 | 192 | 768 GiB | 8 x 940 NVMe SSD | 100 Gigabit | up to 8 NVIDIA L40S Tensor Core GPU |
g6e.xlarge | $1.861 | 4 | 32 GiB | 1 x 250 GB NVMe SSD | Up to 20 Gigabit | up to 8 NVIDIA L4 GPU |
g6e.2xlarge | $2.24208 | 8 | 64 GiB | 1 x 450 GB NVMe SSD | Up to 20 Gigabit | up to 8 NVIDIA L4 GPU |
g6e.4xlarge | $3.00424 | 16 | 128 GiB | 1 x 600 GB NVMe SSD | 20 Gigabit | up to 8 NVIDIA L4 GPU |
g6e.8xlarge | $4.52856 | 32 | 256 GiB | 1 x 900 GB NVMe SSD | 25 Gigabit | up to 8 NVIDIA L4 GPU |
g6e.12xlarge | $10.49264 | 48 | 384 GiB | 2 x 1900 GB NVMe SSD | 100 Gigabit | up to 8 NVIDIA L4 GPU |
g6e.16xlarge | $7.57719 | 64 | 512 GiB | 1 x 1900 GB NVMe SSD | 35 Gigabit | up to 8 NVIDIA L4 GPU |
g6e.24xlarge | $15.06559 | 96 | 768 GiB | 2 x 1900 GB NVMe SSD | 200 Gigabit | up to 8 NVIDIA L4 GPU |
g6e.48xlarge | $30.13118 | 192 | 1536 GiB | 4 x 1900 GB NVMe SSD | 400 Gigabit | up to 8 NVIDIA L4 GPU |