Canine - Tokenization Free Encoder

Total time needed: ~3 hours
Here, you will be learn about CANINE – a way of encoding text without using (explicit) tokenization.
Potential Use Cases
To substitute existing tokenization based input sequences in NLP related tasks
Who is This For ?
INTERMEDIATENLP users investigating options for tokenization in their models
Click on each of the following annotated items to see details.
ARTICLE 1. A 10,000-Feet overview of Tokenization
  • What are the steps involved in tokenization?
  • What does a tokenizer do?
  • What is padding and truncation?
  • How to deal with pre-tokenized inputs?
25 minutes
ARTICLE 2. Different Types of Tokenization Schemes
  • What is sub-word tokenization?
  • What are some of the common sub-word tokenization algorithms?
25 minutes
ARTICLE 3. Tokenizers: How machines read
  • What is the need for tokenization?
  • What is the difference between (and pros-cons of) character, word, and sub-word tokenizers?
  • How to use the SentencePiece and HuggingFace tokenizers?
40 minutes
WRITEUP 4. CANINE – A Tokenization-Free Encoder
  • What is the gist of the paper CANINE - Pre-training an Efficient Tokenization-Free Encoder for Language Representation?
10 minutes
ARTICLE 5. CANINE - Pre-training an Efficient Tokenization-Free Encoder for Language Representation
  • What are the drawbacks of the (exisiting) different tokenization-based models?
  • How does CANINE address these issues?
40 minutes

Concepts Covered

0 comment