Tokenization is an essential component of NLP systems. Although a number of tokenization schemes exist, they all have some limitations. For example, it is difficult to deal with

  • the large number of explicit rules for tokenization
  • informal texts with typos, spelling variations, transliteration, emojis
  • using the same sub-word tokenization for all languages.
  • languages with complex structures
  • languages where words are not separated by space
  • learning infrequent words efficiently and effectively

CANINE is a tokenization-free model with a large language encoder with a transformer stack at its core Its input is a sequence of unicode characters. This input is model agnostic and covers a wide range (>900) of languages.

The authors present three motives for CANINE.

  • to address the intricacies of language
  • to generalize without compromising performance
  • to reduce the effort needed for hand-crafting rules

In the model, the authors use hashing to reduce the number of parameters. Across the model, it is further reduced (downsampled) using strided convolutions. CANINE is trained on the NSP and MLM tasks (just like BERT). For the evalution, it uses the following three tasks

  • TYDI QA Primary Task:
  • Passage Selection Task:
  • Minimal Answer Span Task
Covers: theory of Tokenization-Free Model
Estimated time needed to finish: 10 minutes
Questions this item addresses:
  • What is the gist of the paper CANINE - Pre-training an Efficient Tokenization-Free Encoder for Language Representation?
0 comment

Canine - Tokenization Free Encoder

Total time needed: ~3 hours
Here, you will be learn about CANINE – a way of encoding text without using (explicit) tokenization.
Potential Use Cases
To substitute existing tokenization based input sequences in NLP related tasks
Who is This For ?
INTERMEDIATENLP users investigating options for tokenization in their models
Click on each of the following annotated items to see details.
ARTICLE 1. A 10,000-Feet overview of Tokenization
  • What are the steps involved in tokenization?
  • What does a tokenizer do?
  • What is padding and truncation?
  • How to deal with pre-tokenized inputs?
25 minutes
ARTICLE 2. Different Types of Tokenization Schemes
  • What is sub-word tokenization?
  • What are some of the common sub-word tokenization algorithms?
25 minutes
ARTICLE 3. Tokenizers: How machines read
  • What is the need for tokenization?
  • What is the difference between (and pros-cons of) character, word, and sub-word tokenizers?
  • How to use the SentencePiece and HuggingFace tokenizers?
40 minutes
WRITEUP 4. CANINE – A Tokenization-Free Encoder
  • What is the gist of the paper CANINE - Pre-training an Efficient Tokenization-Free Encoder for Language Representation?
10 minutes
ARTICLE 5. CANINE - Pre-training an Efficient Tokenization-Free Encoder for Language Representation
  • What are the drawbacks of the (exisiting) different tokenization-based models?
  • How does CANINE address these issues?
40 minutes

Concepts Covered

0 comment