If you need a programmatic interface for tokenizing text, check out tiktoken package for Python. For JavaScript, the community-supported @dbdq/tiktoken package works with most GPT models.
This @dbdq/tiktoken is officially recommended by the OpenAI and it is community supported.
tiktoken is a BPE tokeniser for use with OpenAI’s models, forked from the original tiktoken library to provide JS/WASM bindings for NodeJS and other JS runtimes.
This repository contains the following packages:
tiktoken (formally hosted at @dqbd/tiktoken): WASM bindings for the original Python library, providing full 1-to-1 feature parity.
js-tiktoken: Pure JavaScript port of the original library with the core functionality, suitable for environments where WASM is not well supported or not desired (such as edge runtimes).
Documentation for js-tiktoken can be found in here. Documentation for the tiktoken can be found here.
Basic usage follows, which includes all the OpenAI encoders and ranks:
import assert from "node:assert";import { get_encoding, encoding_for_model } from "tiktoken";const enc = get_encoding("gpt2");assert( new TextDecoder().decode(enc.decode(enc.encode("hello world"))) === "hello world");// To get the tokeniser corresponding to a specific model in the OpenAI API:const enc = encoding_for_model("text-davinci-003");// Extend existing encoding with custom special tokensconst enc = encoding_for_model("gpt2", { "<|im_start|>": 100264, "<|im_end|>": 100265,});// don't forget to free the encoder after it is not usedenc.free();
Now that we understand what tiktoken is, lets look at some usage example in tiktokenizer app.