Information on Chat Models

I wanted to create this page to compare this pricing / capabilities of different AI models, research how much it would cost to host an open source model, and get a better sense of what Large Language Models are out there.

Last Updated:11/14/2024

Some Terminology

Tokens

Tokens are words, character sets, or combinations of words and punctuation that are used by large language models (LLMs) to break down text.

Tokens can be single characters like z or whole words like cat. Long words are broken up into several tokens. The set of all tokens used by the model is called the vocabulary, and the process of splitting text into tokens is called tokenization.

The set of all tokens used by the model is called the vocabulary, and the process of splitting text into tokens is called tokenization. Tokens are used by the models below to calculate the cost of an API call. They have different methods for calculating tokens:

OpenAI

OpenAI Counting Tokens Reference
tiktoken is a fast open-source tokenizer by OpenAI. Given a text string and an encoding, a tokenizer can split the text string into a list of tokens. Knowing how many tokens that are in a text string can tell you whether the string is too long for a text model to process and how much an OpenAI API call costs. There are python libraries and JavaScript libraries that allow you to count OpenAI tokens (there are also libraries for other languages - see link above).

Google

Gemini Counting Tokens Reference
Counting Tokens Code Example Gemini
For Gemini models, a token is equivalent to about 4 characters. 100 tokens is equal to about 60-80 English words. When billing is enabled, the cost of a call to the Gemini API is determined in part by the number of input and output tokens, so knowing how to count tokens can be helpful.
You can count tokens in the following ways:

Call countTokens with the input of the request: This returns the total number of tokens in the input only. You can make this call before sending the input to the model to check the size of your requests.
use the usageMetadata attribute on the response object after calling generate_content: This returns the total number of tokens in both the input and the output: totalTokenCount.

Anthropic

Anthropic Token Counting
To access this feature, include the anthropic-beta:token-counting-2024-11-01 header in your API requests, or use client.beta.messages.count_tokens in your SDK calls.

Token counting enables you to determine the number of tokens in a message before sending it to Claude, helping you make informed decisions about your prompts and usage. How to count tokens with Anthropic:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
// Vallid Token Counting Models: Claude 2.5 Sonnet, Claude 3.5 Haiku, Claude 3 Haiku, Claude 3 Opus
const response = await client.beta.messages.countTokens({
  betas: ["token-counting-2024-11-01"],
  model: 'claude-3-5-sonnet-20241022',
  system: 'You are a scientist',
  messages: [{
    role: 'user',
    content: 'Hello, Claude'
  }]
});

console.log(response); //{ "input_tokens": 14 }

Llama API

I haven't seen a way where you can count tokens with Llama API yet.

Abbreviations:

LLM = Large Language Model
RPM = Requests Per Minute
TPM = Tokens per Minute
RPD = Requests per Day
Llama = Large Language Model Meta AI

Enterprise API Models

Looking into the differences between the Enterprise LLM providers. I want to see the cost breakdowns of each and what their model capabilities are.

OpenAI

Multiple models, each with different capabilities and price points. Prices can be viewed in units of either per 1M or 1K tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words.
Language models are also available in the Batch API (opens in a new window) that returns completions within 24 hours for a 50% discount.

Learn more about OpenAI pricing on the OpenAI API pricing page.
Models:

GPT-4o
- GPT-4o mini is our most cost-efficient small model that's smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. The model has 128k context and an October 2023 knowledge cutoff.
GPT-4o mini
- GPT-4o mini is our most cost-efficient small model that's smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. The model has 128k context and an October 2023 knowledge cutoff.
OpenAI o1-preview
- o1-preview is our new reasoning model for complex tasks. The model has 128k context and an October 2023 knowledge cutoff.
OpenAI o1-mini
- o1-mini is a fast, cost-efficient reasoning model tailored to coding, math, ans science use cases. The mosel has 128k context and an October 2023 knowledge cutoff.

OpenAI Models Prices
Model	Input Tokens Pricing (/ 1M input Tokens)	Batched Input Tokens Pricing (/ 1M output Tokens)	Cached Input Token Pricing (/ 1M cached input Tokens)	Output Token Pricing (/ 1M output Tokens)	Batched Output Token Pricing (/ 1M output Tokens)
gpt-4o-mini	$0.150	$0.075	$0.075	$0.600	$0.300
gpt-4o-mini-2024-07-18	$0.150	$0.075	$0.075	$0.600	$0.300
o1-preview	$15.00	N/A	$7.50	$60.00	N/A
o1-preview-2024-09-12	$15.00	N/A	$7.50	$60.00	N/A
o1-mini	$3.00	N/A	$1.50	$12.00	N/A
o1-mini-2024-09-12	$3.00	N/A	$1.50	$12.00	N/A

Google

The Google Gemini API is a powerful tool that provides access to Google DeepMind's Gemini models. These models are designed to be multimodal, meaning they can understand and process various types of data, including text, images, code, and audio.

Find out more about Google Gemini Pricing on its pricing page. List of current Gemini models available through the API:

Gemini 1.5 Flash
- Rate Limits:
  - 2,000 RPM
  - 4,000,000 TPM
- Context Caching: $1.00 / 1,000,000 tokens per hour
- Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
Our fastest model with great performance for diverse, repetitive tasks and a 1 million context window.
Gemini 1.5 Flask-8B: Our smallest model for lower intelligence use cases with a 1 million token context window.
- Rate Limits:
  - 4,000 RPM
  - 4,000,000 TPM
- Context Caching: $0.25 / 1,000,000 tokens per hour
- Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
Gemini 1.5 Pro: Our next generation model with a breathrough 2 million context window. Now generally availabel fro production use.
- Rate Limits:
  - 1,000 RPM
  - 4,000,000 TPM
- Context Caching: $4.50 / 1,000,000 tokens per hour
- Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
Gemini 1.0 Pro: Our first generation model offering only text and image reasoning. Generally available for production use.
- Rate Limits:
  - 360 RPM
  - 120,000 TPM
  - 30,000 TPM
- Context Caching: $4.50 / 1,000,000 tokens per hour
- Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)

Google Model Pricing
Model	Prompts up To 128k Tokens			Prompts Longer Than 128k Tokens
Model	Input Pricing	Output Pricing	Context Caching	Input Pricing	Output Pricing	Context Caching
Gemini 1.5 Flash:	$0.075 / 1 million tokens	$0.30 / 1 million tokens	$0.01875 / 1 million tokens	$0.15 / 1 million tokens	$0.60 / 1 million tokens	$0.0375 / 1 million tokens
Gemini 1.5 Flash-8B:	$0.0375 / 1 million tokens	$0.15 / 1 million tokens	$0.01 / 1 million tokens	$0.075 / 1 million tokens	$0.30 / 1 million tokens	$0.02 / 1 million tokens
Gemini 1.5 Pro:	$1.25 / 1 million tokens	$5.00 / 1 million tokens	$0.3125 / 1 million tokens	$2.50 / 1 million tokens	$10.00 / 1 million tokens	$0.625 / 1 million tokens
Gemini 1.0 Pro:	$0.50 / 1 million tokens	$1.50 / 1 million tokens	N/A	$0.50 / 1 million tokens	$1.50 / 1 million tokens	N/A

Anthropic

Anthropic is also the name of an AI research and safety company founded in 2021 by former OpenAI researchers. Its mission is to create AI systems that are aligned with human values and are safe and reliable. They focus on developing large language models and other AI tools while emphasizing ethical guidelines and safety, aiming to prevent harmful consequences associated with advanced AI.

Models:

Claude 3.5 Sonnet
- Our most intelligent model to date.
- 200K context window
- 50% discount with the Batches API
Claude 3.5 Haiku
- Fastest, most cost-effective model
- 200K Context Window
- 50% discount with the Batches API
Claude 3 Opus
- Powerful model for complex tasks
- 200K Context window
- 50% Discount with the Batches API*

Model	Input (/ 1M Tokens)	Prompt Caching Write (/ 1M Tokens)	Prompt Caching Read (/ 1M Tokens)	Output (/ 1M Tokens)
Claude 3.5 Sonnet	$3	$3.75	$0.30	$15
Claude 3.5 Haiku	$1	$1.25	$0.10	$5
Claude 3 Opus	$15	$18.75	$1.50	$75

Anthropic Prompt Cache Read Tokens to $100

Hosted Llama APIs

The Llama family of models are open-source models released by Meta AI starting in February 2023. since these models are open source, you can download them and run them on your own server. Given that the cost of a GPU can be expensive, it can be cheaper to run the model through an external API service. The prices below apply to the llama api service.

Title	Parameter Count	Price
Small	0B-8B	$0.0004 / 1k Tokens
Medium	8B-30B	$0.0016 / 1k Tokens
Large	>30B	$0.0028 / 1k Tokens

Other Models Available by Llama API
Name	Author	Pricing (per 1K tokens)
llama3.2-11b-vision	meta	$0.0004
llama3.2-1b	meta	$0.0004
llama3.2-3b	meta	$0.0004
llama3.2-90b-vision	meta	$0.0028
llama3.1-405b	meta	$0.0036
llama3.1-70b	meta	$0.0028
llama3.1-8b	meta	$0.0004
llama3-70b	meta	$0.0028
llama3-8b	meta	$0.0004
gemma2-27b	google	$0.0016
gemma2-9b	google	$0.0004
mixtral-8x22b	mistral	$0.0028
mixtral-8x22b-instruct	mistral	$0.0028
mixtral-8x7b-instruct	mistral	$0.0028
mistral-7b	mistral	$0.0004
mistral-7b-instruct	mistral	$0.0004
llama-7b-32k	meta	$0.0028
llama2-13b	meta	$0.0016
llama2-70b	meta	$0.0028
llama2-7b	meta	$0.0016
Nous-Hermes-2-Mixtral-8x7B-DPO	mistral	$0.0004
Nous-Hermes-2-Yi-34B	custom	$0.0028
Qwen1.5-0.5B-Chat	custom	$0.0004
Qwen1.5-1.8B-Chat	custom	$0.0004
Qwen1.5-110B-Chat	custom	$0.0028
Qwen1.5-14B-Chat	custom	$0.0016
Qwen1.5-32B-Chat	custom	$0.0028
Qwen1.5-4B-Chat	custom	$0.0004
Qwen1.5-72B-Chat	custom	$0.0028
Qwen1.5-7B-Chat	custom	$0.0004
Qwen2-72B-Instruct	custom	$0.0028

How These APIs Work

In this section, I am going to take some notes on the documentation for the various API services and ideate on how to store messages in database.
Documentation Links:

OpenAI

There are three types of roles when doing chat completion (AI chat) through OpenAI's chat completion API (and other enterprise models):

system
- Messages with the system role act as top-level instructions to the model, and typically describe what the model is supposed to do ans how it should generally behave and respond.
- Example:
  - You are a helpful assistant that answers programming questions in the style of an ancient chinese emperor.
user
- User messages contain instructions that request a particular type of output from the model. You can think of user messages as the messages you might type in to ChatGPT as an end user.
assistant
- Messages with the assistant role are presumed to have been generated by the model, perhaps in a previous generation request. They can also be used to provide examples to the model for how it should respond to the current request - a technique known as few shot learning.

JavaScript

import OpenAI from "openai";
const openai = new OpenAI();
const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
        {"role": "user", "content": "write a haiku about ai"}
    ]
});

Python

from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "write a haiku about ai"}
    ]
)

Google

JavaScript

const { GoogleGenerativeAI } = require("@google/generative-ai");

const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const model = genAI.getGenerativeModel({ model: "gemini-1.5-flash" });

const prompt = "Explain how AI works";

const result = await model.generateContent(prompt);
console.log(result.response.text());

Python

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content("Explain how AI works")
print(response.text)

The APIs for llama and Anthropic are both very similar to the OpenAI and Google APIs. They can easily be found using the links above.

How to Store Chats in Database?

Hosting Your Own Model

This section will basically go over some AWS prices. To run a LLM, you will need an EC2 instance with a GPU, so I am going to be looking into that. I will also be looking into the prices of additional RAM.

We recommend a GPU instance for most deep learning purposes. Training new models is faster on a GPU instance than a CPU instance. You can scale sub-linearly when you have multi-GPU instances or if you use distributed training across many instances with GPUs.

Instance Name:	On-Demand Hourly Rate	vCPU	Memory	Storage	Network Performance	GPUs
p5e.48xlarge	$108.152	192	2048 GiB	8 x 3840 GB SSD	3200 Gigabit	up to 8 NVIDIA Tesla H200 GPUs
p5.48xlarge	$98.32	192	2048 gIb	8 X 3840 GB SSD	3200 Gigabit	up to 8 NVIDIA Tesla H100 GPUs
p4d.24xlarge	$32.7726	96	1152 GiB	8 x 1000 SSD	400 Gigabit	up to 8 NVIDIA Tesla A100 GPUs
p3.2xlarge	$3.06	8	61 GiB	EBS Only	Up to 10 Gigabit	up to 8 NVIDIA Tesla V100 GPUs
p3.8xlarge	$12.24	32	244 GiB	EBS Only	10 Gigabit	up to 8 NVIDIA Tesla V100 GPUs
p3.16xlarge	$24.48	64	488 GiB	EBS Only	25 Gigabit	up to 8 NVIDIA Tesla V100 GPUs
g3.4xlarge	$1.14	16	122 GiB	EBS Only	Up to 10 Gigabit	up to 4 NVIDIA Tesla M60 GPUs
g3.8xlarge	$2.28	32	244 GiB	EBS Only	10 Gigabit	up to 4 NVIDIA Tesla M60 GPUs
g3.16xlarge	$4.56	64	488 GiB	EBS Only	20 Gigabit	up to 4 NVIDIA Tesla M60 GPUs
g3s.xlarge	$0.75	4	30.5 GiB	EBS Only	10 Gigabit	up to 4 NVIDIA Tesla M60 GPUs
g4ad.xlarge	$0.37853	4	16 GiB	150 GB NVMe SSD	Up to 10 Gigabit	up to 4 NVIDIA T4 GPUs
g4ad.2xlarge	$0.54117	8	32 GiB	300 GB NVMe SSD	Up to 10 Gigabit	up to 4 NVIDIA T4 GPUs
g4ad.4xlarge	$0.867	16	64 GiB	600 GB NVMe SSD	Up to 10 Gigabit	up to 4 NVIDIA T4 GPUs
g4ad.8xlarge	$1.734	32	128 GiB	1200 GB NVMe SSD	15 Gigabit	up to 4 NVIDIA T4 GPUs
g4ad.16xlarge	$3.468	64	256 GiB	2400 GB NVMe SSD	25 Gigabit	up to 4 NVIDIA T4 GPUs
g4dn.xlarge	$0.526	4	16 GiB	125 GB NVMe SSD	Up to 25 Gigabit	up to 4 NVIDIA T4 GPUs
g4dn.2xlarge	$0.752	8	32 GiB	225 GB NVMe SSD	Up to 25 Gigabit	up to 4 NVIDIA T4 GPUs
g4dn.4xlarge	$1.204	16	64 GiB	225 GB NVMe SSD	Up to 25 Gigabit	up to 4 NVIDIA T4 GPUs
g5.xlarge	$1.006	4	16 GiB	1 x 250 GB NVMe SSD	Up to 10 Gigabit	up to 8 NVIDIA A10G GPUs
g5.2xlarge	$1.212	8	32 GiB	1 x 450 GB NVMe SSD	Up to 10 Gigabit	up to 8 NVIDIA A10G GPUs
g5.4xlarge	$1.624	16	64 GiB	1 x 600 GB NVMe SSD	Up to 25 Gigabit	up to 8 NVIDIA A10G GPUs
g5.8xlarge	$2.448	32	128 GiB	1 x 900 GB NVMe SSD	25 Gigabit	up to 8 NVIDIA A10G GPUs
g5.12xlarge	$5.672	48	192 GiB	1 x 3800 GB NVMe SSD	40 Gigabit	up to 8 NVIDIA A10G GPUs
g5.16xlarge	$4.096	64	256 GiB	1 x 1900 GB NVMe SSD	25 Gigabit	up to 8 NVIDIA A10G GPUs
g5.24xlarge	$8.144	96	384 GiB	1 x 3800 GB NVMe SSD	50 Gigabit	up to 8 NVIDIA A10G GPUs
g5.48xlarge	$16.288	192	768 GiB	2 x 3800 GB NVMe SSD	100 Gigabit	up to 8 NVIDIA A10G GPUs
g6.xlarge	$0.8048	4	16 GiB	1 x 250 GB NVMe SSD	Up to 10 Gigabit	up to 8 NVIDIA L40S Tensor Core GPU
g6.2xlarge	$0.9776	8	32 GiB	1 x 450 GB NVMe SSD	Up to 10 Gigabit	up to 8 NVIDIA L40S Tensor Core GPU
g6.4xlarge	$1.3232	16	64 GiB	1 x 600 GB NVMe SSD	Up to 25 Gigabit	up to 8 NVIDIA L40S Tensor Core GPU
g6.8xlarge	$2.0144	32	128 GiB	2 x 450 GB NVMe SSD	25 Gigabit	up to 8 NVIDIA L40S Tensor Core GPU
g6.12xlarge	$4.6016	48	192 GiB	4 X 940 GB NVMe SSD	40 Gigabit	up to 8 NVIDIA L40S Tensor Core GPU
g6.16xlarge	$3.3968	64	256 GiB	2 x 940 GB NVMe SSD	25 Gigabit	up to 8 NVIDIA L40S Tensor Core GPU
g6.24xlarge	$6.6752	96	384 GiB	4 X 940 GB NVMe SSD	50 Gigabit	up to 8 NVIDIA L40S Tensor Core GPU
g6.48xlarge	$13.3504	192	768 GiB	8 x 940 NVMe SSD	100 Gigabit	up to 8 NVIDIA L40S Tensor Core GPU
g6e.xlarge	$1.861	4	32 GiB	1 x 250 GB NVMe SSD	Up to 20 Gigabit	up to 8 NVIDIA L4 GPU
g6e.2xlarge	$2.24208	8	64 GiB	1 x 450 GB NVMe SSD	Up to 20 Gigabit	up to 8 NVIDIA L4 GPU
g6e.4xlarge	$3.00424	16	128 GiB	1 x 600 GB NVMe SSD	20 Gigabit	up to 8 NVIDIA L4 GPU
g6e.8xlarge	$4.52856	32	256 GiB	1 x 900 GB NVMe SSD	25 Gigabit	up to 8 NVIDIA L4 GPU
g6e.12xlarge	$10.49264	48	384 GiB	2 x 1900 GB NVMe SSD	100 Gigabit	up to 8 NVIDIA L4 GPU
g6e.16xlarge	$7.57719	64	512 GiB	1 x 1900 GB NVMe SSD	35 Gigabit	up to 8 NVIDIA L4 GPU
g6e.24xlarge	$15.06559	96	768 GiB	2 x 1900 GB NVMe SSD	200 Gigabit	up to 8 NVIDIA L4 GPU
g6e.48xlarge	$30.13118	192	1536 GiB	4 x 1900 GB NVMe SSD	400 Gigabit	up to 8 NVIDIA L4 GPU