Information on Chat Models

I wanted to create this page to compare this pricing / capabilities of different AI models, research how much it would cost to host an open source model, and get a better sense of what Large Language Models are out there.

Last Updated:11/14/2024

Some Terminology

Tokens

Tokens are words, character sets, or combinations of words and punctuation that are used by large language models (LLMs) to break down text.
Tokens can be single characters like z or whole words like cat. Long words are broken up into several tokens. The set of all tokens used by the model is called the vocabulary, and the process of splitting text into tokens is called tokenization.

The set of all tokens used by the model is called the vocabulary, and the process of splitting text into tokens is called tokenization. Tokens are used by the models below to calculate the cost of an API call. They have different methods for calculating tokens:

OpenAI

OpenAI Counting Tokens Reference
tiktoken is a fast open-source tokenizer by OpenAI. Given a text string and an encoding, a tokenizer can split the text string into a list of tokens. Knowing how many tokens that are in a text string can tell you whether the string is too long for a text model to process and how much an OpenAI API call costs. There are python libraries and JavaScript libraries that allow you to count OpenAI tokens (there are also libraries for other languages - see link above).

Google

Gemini Counting Tokens Reference
Counting Tokens Code Example Gemini
For Gemini models, a token is equivalent to about 4 characters. 100 tokens is equal to about 60-80 English words. When billing is enabled, the cost of a call to the Gemini API is determined in part by the number of input and output tokens, so knowing how to count tokens can be helpful.
You can count tokens in the following ways:

Anthropic

Anthropic Token Counting
To access this feature, include the anthropic-beta:token-counting-2024-11-01 header in your API requests, or use client.beta.messages.count_tokens in your SDK calls.

Token counting enables you to determine the number of tokens in a message before sending it to Claude, helping you make informed decisions about your prompts and usage. How to count tokens with Anthropic:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
// Vallid Token Counting Models: Claude 2.5 Sonnet, Claude 3.5 Haiku, Claude 3 Haiku, Claude 3 Opus
const response = await client.beta.messages.countTokens({
  betas: ["token-counting-2024-11-01"],
  model: 'claude-3-5-sonnet-20241022',
  system: 'You are a scientist',
  messages: [{
    role: 'user',
    content: 'Hello, Claude'
  }]
});

console.log(response); //{ "input_tokens": 14 }
Llama API

I haven't seen a way where you can count tokens with Llama API yet.

Abbreviations:

Enterprise API Models

Looking into the differences between the Enterprise LLM providers. I want to see the cost breakdowns of each and what their model capabilities are.

OpenAI

Multiple models, each with different capabilities and price points. Prices can be viewed in units of either per 1M or 1K tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words.
Language models are also available in the Batch API (opens in a new window) that returns completions within 24 hours for a 50% discount.

Learn more about OpenAI pricing on the OpenAI API pricing page.
Models:

  • GPT-4o
    • GPT-4o mini is our most cost-efficient small model that's smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. The model has 128k context and an October 2023 knowledge cutoff.
  • GPT-4o mini
    • GPT-4o mini is our most cost-efficient small model that's smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. The model has 128k context and an October 2023 knowledge cutoff.
  • OpenAI o1-preview
    • o1-preview is our new reasoning model for complex tasks. The model has 128k context and an October 2023 knowledge cutoff.
  • OpenAI o1-mini
    • o1-mini is a fast, cost-efficient reasoning model tailored to coding, math, ans science use cases. The mosel has 128k context and an October 2023 knowledge cutoff.
OpenAI Models Prices
Model Input Tokens Pricing (/ 1M input Tokens) Batched Input Tokens Pricing (/ 1M output Tokens) Cached Input Token Pricing (/ 1M cached input Tokens) Output Token Pricing (/ 1M output Tokens) Batched Output Token Pricing (/ 1M output Tokens)
gpt-4o-mini $0.150 $0.075 $0.075 $0.600 $0.300
gpt-4o-mini-2024-07-18 $0.150 $0.075 $0.075 $0.600 $0.300
o1-preview $15.00 N/A $7.50 $60.00 N/A
o1-preview-2024-09-12 $15.00 N/A $7.50 $60.00 N/A
o1-mini $3.00 N/A $1.50 $12.00 N/A
o1-mini-2024-09-12 $3.00 N/A $1.50 $12.00 N/A

Google

The Google Gemini API is a powerful tool that provides access to Google DeepMind's Gemini models. These models are designed to be multimodal, meaning they can understand and process various types of data, including text, images, code, and audio.

Find out more about Google Gemini Pricing on its pricing page. List of current Gemini models available through the API:

  • Gemini 1.5 Flash
    • Rate Limits:
      • 2,000 RPM
      • 4,000,000 TPM
    • Context Caching: $1.00 / 1,000,000 tokens per hour
    • Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
    Our fastest model with great performance for diverse, repetitive tasks and a 1 million context window.
  • Gemini 1.5 Flask-8B: Our smallest model for lower intelligence use cases with a 1 million token context window.
    • Rate Limits:
      • 4,000 RPM
      • 4,000,000 TPM
    • Context Caching: $0.25 / 1,000,000 tokens per hour
    • Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
  • Gemini 1.5 Pro: Our next generation model with a breathrough 2 million context window. Now generally availabel fro production use.
    • Rate Limits:
      • 1,000 RPM
      • 4,000,000 TPM
    • Context Caching: $4.50 / 1,000,000 tokens per hour
    • Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
  • Gemini 1.0 Pro: Our first generation model offering only text and image reasoning. Generally available for production use.
    • Rate Limits:
      • 360 RPM
      • 120,000 TPM
      • 30,000 TPM
    • Context Caching: $4.50 / 1,000,000 tokens per hour
    • Grounding with Google Search: $35 / 1,000 grounding requets (up to 5k grounding requests per day)
Google Model Pricing
Model Prompts up To 128k Tokens Prompts Longer Than 128k Tokens
Input Pricing Output Pricing Context Caching Input Pricing Output Pricing Context Caching
Gemini 1.5 Flash: $0.075 / 1 million tokens $0.30 / 1 million tokens $0.01875 / 1 million tokens $0.15 / 1 million tokens $0.60 / 1 million tokens $0.0375 / 1 million tokens
Gemini 1.5 Flash-8B: $0.0375 / 1 million tokens $0.15 / 1 million tokens $0.01 / 1 million tokens $0.075 / 1 million tokens $0.30 / 1 million tokens $0.02 / 1 million tokens
Gemini 1.5 Pro: $1.25 / 1 million tokens $5.00 / 1 million tokens $0.3125 / 1 million tokens $2.50 / 1 million tokens $10.00 / 1 million tokens $0.625 / 1 million tokens
Gemini 1.0 Pro: $0.50 / 1 million tokens $1.50 / 1 million tokens N/A $0.50 / 1 million tokens $1.50 / 1 million tokens N/A

Anthropic

Anthropic is also the name of an AI research and safety company founded in 2021 by former OpenAI researchers. Its mission is to create AI systems that are aligned with human values and are safe and reliable. They focus on developing large language models and other AI tools while emphasizing ethical guidelines and safety, aiming to prevent harmful consequences associated with advanced AI.

Models:

  • Claude 3.5 Sonnet
    • Our most intelligent model to date.
    • 200K context window
    • 50% discount with the Batches API
  • Claude 3.5 Haiku
    • Fastest, most cost-effective model
    • 200K Context Window
    • 50% discount with the Batches API
  • Claude 3 Opus
    • Powerful model for complex tasks
    • 200K Context window
    • 50% Discount with the Batches API*
Model Input (/ 1M Tokens) Prompt Caching Write (/ 1M Tokens) Prompt Caching Read (/ 1M Tokens) Output (/ 1M Tokens)
Claude 3.5 Sonnet $3 $3.75 $0.30 $15
Claude 3.5 Haiku $1 $1.25 $0.10 $5
Claude 3 Opus $15 $18.75 $1.50 $75

Hosted Llama APIs

The Llama family of models are open-source models released by Meta AI starting in February 2023. since these models are open source, you can download them and run them on your own server. Given that the cost of a GPU can be expensive, it can be cheaper to run the model through an external API service. The prices below apply to the llama api service.
Title Parameter Count Price
Small 0B-8B $0.0004 / 1k Tokens
Medium 8B-30B $0.0016 / 1k Tokens
Large >30B $0.0028 / 1k Tokens
Other Models Available by Llama API
NameAuthorPricing (per 1K tokens)
llama3.2-11b-visionmeta$0.0004
llama3.2-1bmeta$0.0004
llama3.2-3bmeta$0.0004
llama3.2-90b-visionmeta$0.0028
llama3.1-405bmeta$0.0036
llama3.1-70bmeta$0.0028
llama3.1-8bmeta$0.0004
llama3-70bmeta$0.0028
llama3-8bmeta$0.0004
gemma2-27bgoogle$0.0016
gemma2-9bgoogle$0.0004
mixtral-8x22bmistral$0.0028
mixtral-8x22b-instructmistral$0.0028
mixtral-8x7b-instructmistral$0.0028
mistral-7bmistral$0.0004
mistral-7b-instructmistral$0.0004
llama-7b-32kmeta$0.0028
llama2-13bmeta$0.0016
llama2-70bmeta$0.0028
llama2-7bmeta$0.0016
Nous-Hermes-2-Mixtral-8x7B-DPOmistral$0.0004
Nous-Hermes-2-Yi-34Bcustom$0.0028
Qwen1.5-0.5B-Chatcustom$0.0004
Qwen1.5-1.8B-Chatcustom$0.0004
Qwen1.5-110B-Chatcustom$0.0028
Qwen1.5-14B-Chatcustom$0.0016
Qwen1.5-32B-Chatcustom$0.0028
Qwen1.5-4B-Chatcustom$0.0004
Qwen1.5-72B-Chatcustom$0.0028
Qwen1.5-7B-Chatcustom$0.0004
Qwen2-72B-Instructcustom$0.0028

How These APIs Work

In this section, I am going to take some notes on the documentation for the various API services and ideate on how to store messages in database.
Documentation Links:

OpenAI

There are three types of roles when doing chat completion (AI chat) through OpenAI's chat completion API (and other enterprise models):

  1. system
    • Messages with the system role act as top-level instructions to the model, and typically describe what the model is supposed to do ans how it should generally behave and respond.
    • Example:
      • You are a helpful assistant that answers programming questions in the style of an ancient chinese emperor.
  2. user
    • User messages contain instructions that request a particular type of output from the model. You can think of user messages as the messages you might type in to ChatGPT as an end user.
  3. assistant
    • Messages with the assistant role are presumed to have been generated by the model, perhaps in a previous generation request. They can also be used to provide examples to the model for how it should respond to the current request - a technique known as few shot learning.
JavaScript
import OpenAI from "openai";
const openai = new OpenAI();
const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
        {"role": "user", "content": "write a haiku about ai"}
    ]
});
Python
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "write a haiku about ai"}
    ]
)

Google

JavaScript
const { GoogleGenerativeAI } = require("@google/generative-ai");

const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const model = genAI.getGenerativeModel({ model: "gemini-1.5-flash" });

const prompt = "Explain how AI works";

const result = await model.generateContent(prompt);
console.log(result.response.text());
Python
import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content("Explain how AI works")
print(response.text)

The APIs for llama and Anthropic are both very similar to the OpenAI and Google APIs. They can easily be found using the links above.

How to Store Chats in Database?

Hosting Your Own Model

This section will basically go over some AWS prices. To run a LLM, you will need an EC2 instance with a GPU, so I am going to be looking into that. I will also be looking into the prices of additional RAM.

We recommend a GPU instance for most deep learning purposes. Training new models is faster on a GPU instance than a CPU instance. You can scale sub-linearly when you have multi-GPU instances or if you use distributed training across many instances with GPUs.
Instance Name: On-Demand Hourly Rate vCPU Memory Storage Network Performance GPUs
p5e.48xlarge $108.152 192 2048 GiB 8 x 3840 GB SSD 3200 Gigabit up to 8 NVIDIA Tesla H200 GPUs
p5.48xlarge $98.32 192 2048 gIb 8 X 3840 GB SSD 3200 Gigabit up to 8 NVIDIA Tesla H100 GPUs
p4d.24xlarge $32.7726 96 1152 GiB 8 x 1000 SSD 400 Gigabit up to 8 NVIDIA Tesla A100 GPUs
p3.2xlarge $3.06 8 61 GiB EBS Only Up to 10 Gigabit up to 8 NVIDIA Tesla V100 GPUs
p3.8xlarge $12.24 32 244 GiB EBS Only 10 Gigabit up to 8 NVIDIA Tesla V100 GPUs
p3.16xlarge $24.48 64 488 GiB EBS Only 25 Gigabit up to 8 NVIDIA Tesla V100 GPUs
g3.4xlarge $1.14 16 122 GiB EBS Only Up to 10 Gigabit up to 4 NVIDIA Tesla M60 GPUs
g3.8xlarge $2.28 32 244 GiB EBS Only 10 Gigabit up to 4 NVIDIA Tesla M60 GPUs
g3.16xlarge $4.56 64 488 GiB EBS Only 20 Gigabit up to 4 NVIDIA Tesla M60 GPUs
g3s.xlarge $0.75 4 30.5 GiB EBS Only 10 Gigabit up to 4 NVIDIA Tesla M60 GPUs
g4ad.xlarge$0.37853 4 16 GiB 150 GB NVMe SSD Up to 10 Gigabit up to 4 NVIDIA T4 GPUs
g4ad.2xlarge$0.54117 8 32 GiB 300 GB NVMe SSD Up to 10 Gigabit up to 4 NVIDIA T4 GPUs
g4ad.4xlarge$0.867 16 64 GiB 600 GB NVMe SSD Up to 10 Gigabit up to 4 NVIDIA T4 GPUs
g4ad.8xlarge$1.734 32 128 GiB 1200 GB NVMe SSD 15 Gigabit up to 4 NVIDIA T4 GPUs
g4ad.16xlarge$3.468 64 256 GiB 2400 GB NVMe SSD 25 Gigabit up to 4 NVIDIA T4 GPUs
g4dn.xlarge$0.526 4 16 GiB 125 GB NVMe SSD Up to 25 Gigabit up to 4 NVIDIA T4 GPUs
g4dn.2xlarge$0.752 8 32 GiB 225 GB NVMe SSD Up to 25 Gigabit up to 4 NVIDIA T4 GPUs
g4dn.4xlarge$1.204 16 64 GiB 225 GB NVMe SSD Up to 25 Gigabit up to 4 NVIDIA T4 GPUs
g5.xlarge$1.006 4 16 GiB 1 x 250 GB NVMe SSD Up to 10 Gigabit up to 8 NVIDIA A10G GPUs
g5.2xlarge$1.212 8 32 GiB 1 x 450 GB NVMe SSD Up to 10 Gigabit up to 8 NVIDIA A10G GPUs
g5.4xlarge$1.624 16 64 GiB 1 x 600 GB NVMe SSD Up to 25 Gigabit up to 8 NVIDIA A10G GPUs
g5.8xlarge$2.448 32 128 GiB 1 x 900 GB NVMe SSD 25 Gigabit up to 8 NVIDIA A10G GPUs
g5.12xlarge$5.672 48 192 GiB 1 x 3800 GB NVMe SSD 40 Gigabit up to 8 NVIDIA A10G GPUs
g5.16xlarge$4.096 64 256 GiB 1 x 1900 GB NVMe SSD 25 Gigabit up to 8 NVIDIA A10G GPUs
g5.24xlarge$8.144 96 384 GiB 1 x 3800 GB NVMe SSD 50 Gigabit up to 8 NVIDIA A10G GPUs
g5.48xlarge$16.288 192 768 GiB 2 x 3800 GB NVMe SSD 100 Gigabit up to 8 NVIDIA A10G GPUs
g6.xlarge$0.8048 4 16 GiB 1 x 250 GB NVMe SSD Up to 10 Gigabit up to 8 NVIDIA L40S Tensor Core GPU
g6.2xlarge$0.9776 8 32 GiB 1 x 450 GB NVMe SSD Up to 10 Gigabit up to 8 NVIDIA L40S Tensor Core GPU
g6.4xlarge$1.3232 16 64 GiB 1 x 600 GB NVMe SSD Up to 25 Gigabit up to 8 NVIDIA L40S Tensor Core GPU
g6.8xlarge$2.0144 32 128 GiB 2 x 450 GB NVMe SSD 25 Gigabit up to 8 NVIDIA L40S Tensor Core GPU
g6.12xlarge$4.6016 48 192 GiB 4 X 940 GB NVMe SSD 40 Gigabit up to 8 NVIDIA L40S Tensor Core GPU
g6.16xlarge$3.3968 64 256 GiB 2 x 940 GB NVMe SSD 25 Gigabit up to 8 NVIDIA L40S Tensor Core GPU
g6.24xlarge$6.6752 96 384 GiB 4 X 940 GB NVMe SSD 50 Gigabit up to 8 NVIDIA L40S Tensor Core GPU
g6.48xlarge$13.3504 192 768 GiB 8 x 940 NVMe SSD 100 Gigabit up to 8 NVIDIA L40S Tensor Core GPU
g6e.xlarge$1.861 4 32 GiB 1 x 250 GB NVMe SSD Up to 20 Gigabit up to 8 NVIDIA L4 GPU
g6e.2xlarge$2.24208 8 64 GiB 1 x 450 GB NVMe SSD Up to 20 Gigabit up to 8 NVIDIA L4 GPU
g6e.4xlarge$3.00424 16 128 GiB 1 x 600 GB NVMe SSD 20 Gigabit up to 8 NVIDIA L4 GPU
g6e.8xlarge$4.52856 32 256 GiB 1 x 900 GB NVMe SSD 25 Gigabit up to 8 NVIDIA L4 GPU
g6e.12xlarge$10.49264 48 384 GiB 2 x 1900 GB NVMe SSD 100 Gigabit up to 8 NVIDIA L4 GPU
g6e.16xlarge$7.57719 64 512 GiB 1 x 1900 GB NVMe SSD 35 Gigabit up to 8 NVIDIA L4 GPU
g6e.24xlarge$15.06559 96 768 GiB 2 x 1900 GB NVMe SSD 200 Gigabit up to 8 NVIDIA L4 GPU
g6e.48xlarge$30.13118 192 1536 GiB 4 x 1900 GB NVMe SSD 400 Gigabit up to 8 NVIDIA L4 GPU