Blog
Here's how OpenAI Token count is computed in Tiktokenizer - Part 1

Here's how OpenAI Token count is computed in Tiktokenizer - Part 1

In this article, we will review how OpenAI token count is computed in tiktokenizer. We will look at:

  1. TokenViewer component

  2. tokenizer

  3. createTokenizer function

TokenViewer component

In tiktokenizer/src/pages/index.ts, at line 86, you will find the following code

 <section className="flex flex-col gap-4">
    <TokenViewer model={model} data={tokens} isFetching={false} />
 </section>

And TokenViewer is imported as shown below 

import { TokenViewer } from "~/sections/TokenViewer";

This component has three props

  1. model

  2. data

  3. isFetching

So how does this TokenViewer component look like? 

tokenizer

At line 43 in tiktokenizer/src/pages/index.ts, you will find the following code:

const tokenizer = useQuery({
    queryKey: [model],
    queryFn: ({ queryKey: [model] }) => createTokenizer(model!),
  });

  const tokens = tokenizer.data?.tokenize(inputText);

useQuery is imported as shown below:

import { useQuery } from "@tanstack/react-query";

createTokenizer function is imported as shown below

import { createTokenizer } from "~/models/tokenizer";

createTokenizer function

In tiktokenizer/src/models/tokenizer.ts, you will find the following code at line 122:

export async function createTokenizer(name: string): Promise<Tokenizer> {
  console.log("createTokenizer", name);
  const oaiEncoding = oaiEncodings.safeParse(name);
  if (oaiEncoding.success) {
    console.log("oaiEncoding", oaiEncoding.data);
    return new TiktokenTokenizer(oaiEncoding.data);
  }
  const oaiModel = oaiModels.safeParse(name);
  if (oaiModel.success) {
    console.log("oaiModel", oaiModel.data);
    return new TiktokenTokenizer(oaiModel.data);
  }

  const ossModel = openSourceModels.safeParse(name);
  if (ossModel.success) {
    console.log("loading tokenizer", ossModel.data);
    const tokenizer = await OpenSourceTokenizer.load(ossModel.data);
    console.log("loaded tokenizer", name);
    return new OpenSourceTokenizer(tokenizer, name);
  }
  throw new Error("Invalid model or encoding");
}

oaiEncodings, oaiModels, openSourceModels are imported as shown below 

import { oaiEncodings, oaiModels, openSourceModels } from ".";

So this function either returns:

  1. TiktokenTokenizer

  2. OpenSourceTokenizer

You will find more information about oaiEncodings, oaiModels, openSourceModels in tiktokenizer/src/models/tokenizer.ts

You will learn more about TiktokenTokenizer and OpenSourceTokenizer in next article.

About me:

Hey, my name is Ramu Narasinga. I study codebase architecture in large open-source projects.

Email: ramu.narasinga@gmail.com

Want to learn from open-source code? Solve challenges inspired by open-source projects.

References:

  1. https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L86

  2. https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L48

  3. https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L43

  4. https://github.com/dqbd/tiktokenizer/blob/master/src/models/tokenizer.ts#L122

  5. https://tiktokenizer.vercel.app/