Here's how OpenAI Token count is computed in Tiktokenizer - Part 1
In this article, we will review how OpenAI token count is computed in tiktokenizer. We will look at:
-
TokenViewer component
-
tokenizer
-
createTokenizer function
TokenViewer component
In tiktokenizer/src/pages/index.ts, at line 86, you will find the following code
<section className="flex flex-col gap-4">
<TokenViewer model={model} data={tokens} isFetching={false} />
</section>
And TokenViewer is imported as shown below
import { TokenViewer } from "~/sections/TokenViewer";
This component has three props
-
model
-
data
-
isFetching
So how does this TokenViewer component look like?
tokenizer
At line 43 in tiktokenizer/src/pages/index.ts, you will find the following code:
const tokenizer = useQuery({
queryKey: [model],
queryFn: ({ queryKey: [model] }) => createTokenizer(model!),
});
const tokens = tokenizer.data?.tokenize(inputText);
useQuery is imported as shown below:
import { useQuery } from "@tanstack/react-query";
createTokenizer function is imported as shown below
import { createTokenizer } from "~/models/tokenizer";
createTokenizer function
In tiktokenizer/src/models/tokenizer.ts, you will find the following code at line 122:
export async function createTokenizer(name: string): Promise<Tokenizer> {
console.log("createTokenizer", name);
const oaiEncoding = oaiEncodings.safeParse(name);
if (oaiEncoding.success) {
console.log("oaiEncoding", oaiEncoding.data);
return new TiktokenTokenizer(oaiEncoding.data);
}
const oaiModel = oaiModels.safeParse(name);
if (oaiModel.success) {
console.log("oaiModel", oaiModel.data);
return new TiktokenTokenizer(oaiModel.data);
}
const ossModel = openSourceModels.safeParse(name);
if (ossModel.success) {
console.log("loading tokenizer", ossModel.data);
const tokenizer = await OpenSourceTokenizer.load(ossModel.data);
console.log("loaded tokenizer", name);
return new OpenSourceTokenizer(tokenizer, name);
}
throw new Error("Invalid model or encoding");
}
oaiEncodings, oaiModels, openSourceModels are imported as shown below
import { oaiEncodings, oaiModels, openSourceModels } from ".";
So this function either returns:
-
TiktokenTokenizer
-
OpenSourceTokenizer
You will find more information about oaiEncodings, oaiModels, openSourceModels in tiktokenizer/src/models/tokenizer.ts
You will learn more about TiktokenTokenizer and OpenSourceTokenizer in next article.
About me:
Hey, my name is Ramu Narasinga. I study codebase architecture in large open-source projects.
Email: ramu.narasinga@gmail.com
Want to learn from open-source code? Solve challenges inspired by open-source projects.
References:
-
https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L86
-
https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L48
-
https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L43
-
https://github.com/dqbd/tiktokenizer/blob/master/src/models/tokenizer.ts#L122