Use the tiktoken package to tokenize text for OpenAI LLMs.

Blog

In this article, you will learn how to use the tiktoken package to tokenize text for OpenAI LLMs. We will look at:

What is tiktoken?
tiktoken usage examples in Tiktokenizer app.

What is tiktoken?

If you need a programmatic interface for tokenizing text, check out tiktoken package for Python. For JavaScript, the community-supported @dbdq/tiktoken package works with most GPT models.

This @dbdq/tiktoken is officially recommended by the OpenAI and it is community supported.

tiktoken is a BPE tokeniser for use with OpenAI’s models, forked from the original tiktoken library to provide JS/WASM bindings for NodeJS and other JS runtimes.

This repository contains the following packages:

tiktoken (formally hosted at @dqbd/tiktoken): WASM bindings for the original Python library, providing full 1-to-1 feature parity.
js-tiktoken: Pure JavaScript port of the original library with the core functionality, suitable for environments where WASM is not well supported or not desired (such as edge runtimes).

Documentation for js-tiktoken can be found in here. Documentation for the tiktoken can be found here.

Install

The WASM version of tiktoken can be installed from NPM:

npm install tiktoken

Usage

Basic usage follows, which includes all the OpenAI encoders and ranks:

import assert from "node:assert";
import { get_encoding, encoding_for_model } from "tiktoken";

const enc = get_encoding("gpt2");
assert(
  new TextDecoder().decode(enc.decode(enc.encode("hello world"))) ===
    "hello world"
);

// To get the tokeniser corresponding to a specific model in the OpenAI API:
const enc = encoding_for_model("text-davinci-003");

// Extend existing encoding with custom special tokens
const enc = encoding_for_model("gpt2", {
  "<|im_start|>": 100264,
  "<|im_end|>": 100265,
});

// don't forget to free the encoder after it is not used
enc.free();

Now that we understand what tiktoken is, lets look at some usage example in tiktokenizer app.

tiktoken usage examples in Tiktokenizer app.

In tiktokenizer/src/models/tokenizer.ts, at line 2, you will find the following code

import { get_encoding, encoding_for_model, type Tiktoken } from "tiktoken";

get_encoding

You will find the following code in tiktokenizer/src/models/tokenizer.ts at line 44.

   const enc =
        model === "gpt-3.5-turbo" || model === "gpt-4" || model === "gpt-4-32k"
          ? get_encoding("cl100k_base", {
              "<|im_start|>": 100264,
              "<|im_end|>": 100265,
              "<|im_sep|>": 100266,
            })
          : model === "gpt-4o"
          ? get_encoding("o200k_base", {
              "<|im_start|>": 200264,
              "<|im_end|>": 200265,
              "<|im_sep|>": 200266,
            })

encoding_for_model

Below is an example invoking encoding_for_model function.

 encoding_for_model(model);

You will find this above code in tiktokenizer/src/models/tokenizer.ts at line 56.

About me:

Hey, my name is Ramu Narasinga. I study codebase architecture in large open-source projects.

Email: ramu.narasinga@gmail.com

Want to learn from open-source code? Solve challenges inspired by open-source projects.

References:

Insights from Open Source projects

Best practices used in open-source are explained, compared among multiple projects.

Study the codebase architecture and level up your skills.