AI-Accelerated Product Development
CANINE - Tokenization Free Encoder
Total time needed:
Here, you will be learn about CANINE – a way of encoding text without using (explicit) tokenization.
Potential Use Cases
To substitute existing tokenization based input sequences in NLP related tasks
Who is This For ?
NLP users investigating options for tokenization in their models
Click on each of the following
to see details.
1. A 10,000-Feet overview of Tokenization
What are the steps involved in tokenization?
What does a tokenizer do?
What is padding and truncation?
How to deal with pre-tokenized inputs?
2. Different Types of Tokenization Schemes
What is sub-word tokenization?
What are some of the common sub-word tokenization algorithms?
3. Tokenizers: How machines read
What is the need for tokenization?
What is the difference between (and pros-cons of) character, word, and sub-word tokenizers?
How to use the SentencePiece and HuggingFace tokenizers?
4. CANINE – A Tokenization-Free Encoder
What is the gist of the paper CANINE - Pre-training an Efficient Tokenization-Free Encoder for Language Representation?
5. CANINE - Pre-training an Efficient Tokenization-Free Encoder for Language Representation
What are the drawbacks of the (exisiting) different tokenization-based models?
How does CANINE address these issues?