bert

🚀 Feature request

Fast Tokenizer for DeBERTA-V3 and mDeBERTa-V3

Motivation

DeBERTa V3 is an improved version of DeBERTa. With the V3 version, the authors also released a multilingual model "mDeBERTa-base" that outperforms XLM-R-base. However, DeBERTa V3 currently lacks a FastTokenizer implementation which makes it impossible to use with some of the example scripts (They require a Fa

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure.

_handle_duplicate_documents and _drop_duplicate_documents in the elastic search document store will always report self.index as the index with the conflict, which is obviously incorrect.

Edit: Upon further investigation, this is actually a lot worse. Using multiple indices with the ElasticSearch DocumentStore is completely broken due to the fact, that this is used in `_handle_duplicate_do

欢迎您反馈PaddleNLP使用问题，非常感谢您对PaddleNLP的贡献！
在留下您的问题时，辛苦您同步提供如下信息：

版本、环境信息
1）PaddleNLP和PaddlePaddle版本：请提供您的PaddleNLP和PaddlePaddle版本号，例如PaddleNLP 2.0.4，PaddlePaddle2.1.1
2）系统环境：请您描述系统类型，例如Linux/Windows/MacOS/，python版本
复现信息：如为报错，请给出复现环境、复现步骤
paddle版本2.0.8 paddlenlp版本2.1.0
建议，能否在paddlenlp文档中，整理列出各个模型的tokenizer是基于什么类别的based，如bert tokenizer是word piece的，xlnet tokenizer是sentence piece的，以及对应的输入输出样例

Dec	JAN	Feb
	02
2021	2022	2023

bert

Here are 1,992 public repositories matching this topic...

huggingface / transformers

🚀 Feature request

Motivation

graykode / nlp-tutorial

hanxiao / bert-as-service

brightmart / nlp_chinese_corpus

ymcui / Chinese-BERT-wwm

huggingface / tokenizers

PaddlePaddle / ERNIE

codertimo / BERT-pytorch

macanv / BERT-BiLSTM-CRF-NER

deepset-ai / haystack

brightmart / albert_zh

jessevig / bertviz

bentrevett / pytorch-sentiment-analysis

shibing624 / pycorrector

IntelLabs / nlp-architect

PaddlePaddle / PaddleNLP

JohnSnowLabs / spark-nlp

CLUEbenchmark / CLUE

CyberZHG / keras-bert

BrikerMan / Kashgari

asyml / texar

Separius / awesome-sentence-embedding

brightmart / roberta_zh

km1994 / nlp_paper_study

bytedance / lightseq

namisan / mt-dnn

dbiir / UER-py

MaartenGr / BERTopic

Jiakui / awesome-bert

utterworks / fast-bert

Improve this page

Add this topic to your repo