NLP star sky intelligent dialogue robot series for natural language processing: in depth understanding of Transformer natural language processing KantaiBERT ByteLevelBPETokenizer

NLP star sky intelligent dialogue robot series for natural language processing: in depth understanding of Transformer natural language processing KantaiBERT ByteLevelBPETokenizer

Step 3: Training a tokenizer

This paper does not use the pre-trained marker, such as the pre-trained GPT-2 marker, but trains the marker from scratch. Hugging Face's ByteLevelBPETokenizer() will use kant.txt for training. Byte level markers decompose strings or words into substrings or subwords.

Two main advantages of BPETokenizer:

  • Compressed Thesaurus: the marker can decompose words into the smallest components, and then merge these components into statistically interesting components. for example
    "Smaller" and "smallest" can be converted to "small," "er," "est." further, we can obtain "small" and "all". In any case, the word is decomposed into sub word markers and smaller sub word units, such as "small" and "all", rather than simple small
  • Alleviate the problem of OOV (out of vocabulary) to a certain extent, and the string blocks classified as unk_token tags using word encoding will actually disappear.

In this model, we will train the marker with the following parameters:

  • files = path is the path to the dataset.
  • vocab_size=52_000 is the length of our marker model.
  • min_frequency=2 is the minimum frequency threshold.
  • special_tokens = [] is a list of special tokens.

In this case, the list of special marks is:

In this case, the list of special marks is:

•<s>: Start tag

•<pad>: Fill mark

•</s>: End tag

•<unk>: Unknown tag

•<mask>: Mask tags for language modeling

The training marker generates merged substring tags and analyzes their attribute frequencies
Put these two words in the middle of the sentence:

...the tokenizer...

The first step is to mark the string:

'Ġthe', 'Ġtoken', 'izer',

The string is now marked with Ġ (blank) marking of information.

The next step is to replace them with their indexes:

Marker codes are as follows:

#@title Step 3: Training a Tokenizer
#%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Step 4: Saving the files to disk

During training, the marker generates two files:

  • merges.txt, which contains the merged tokenized substring
  • vocab.json, which contains the index of the tokenized substring

The code first creates the KantaiBERT directory and then saves two files:

#@title Step 4: Saving the files to disk
import os
token_dir = '/Chapter03/content/KantaiBERT'
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
#tokenizer.save_model('/Chapter03/model/KantaiBERT')
tokenizer.save_model('/Chapter03/KantaiBERT/')
['/Chapter03/KantaiBERT/vocab.json', '/Chapter03/KantaiBERT/merges.txt']


This case is in Geek cloud gpu Environmental test:



Where the content of config.json file

attention_probs_dropout_prob:0.1
bos_token_id:0
classifier_dropout:null
eos_token_id:2
gradient_checkpointing:false
hidden_act:"gelu"
hidden_dropout_prob:0.1
hidden_size:768
initializer_range:0.02
intermediate_size:3072
layer_norm_eps:1e-12
max_position_embeddings:514
model_type:"roberta"
num_attention_heads:12
num_hidden_layers:6
pad_token_id:1
position_embedding_type:"absolute"
torch_dtype:"float32"
transformers_version:"4.10.2"
type_vocab_size:1
use_cache:true
vocab_size:52000

File format of merges.txt

#version: 0.2 - Trained by `huggingface/tokenizers`
Ġ t
h e
Ġ a
o n
i n
Ġ o
Ġt he
r e
i t
Ġo f
i s
e n
a t
Ġ w
e r
Ġ c
i on
Ġ s
Ġ p
a l
i c
Ġ in
n d
Ġ b
e s
Ġt h
o r
Ġt o
Ġ m
c e
c t
Ġ f
in g
a n
e d
Ġ n
Ġ e
s e
o u
Ġ is
l y
Ġ Ġ
Ġ re
l e
Ġa nd
Ġ it
Ġc on
Ġ d
Ġb e
s t
o m
en t
Ġw h
o t
a r
a s
u t
ic h
e ct
Ġ h
Ġp r
Ġa s
l l
Ġwh ich
Ġth at
t he
Ġ u
i v
i m
it ion
at ion
o s
it y
Ġa n
Ġe x
Ġ on
Ġ l
ic al
Ġn ot
ĠĠ ĠĠ
p t
Ġth is
j ect
o w
es s
Ġw e
it h
Ġf or
i d
b ject
Ġs u
i b
en ce
ce pt
t er
u re
l d
i r
Ġw ith
Ġ g
t h
u r
c i
Ġ I
v er
q u
o g
re s
iv e
cept ion
is t
as on
Ġo r
a y
e re
Ġc an
g h
a in
Ġb y
e c
i f
a nd
se l
p er
â Ģ
s ib
Ġa ll
r om
u l
Ġpr o
e m
p p
Ġre ason
Ġo bject
Ġ T
Ġa re
v e
at e
c on
Ġf rom
c h
Ġd e
en d
u st
o re
Ġin t
p le
Ġb ut
i l
the r
a ct
Ġa l
he n
u s
Ġit s
os sib
Ġh a
Ġcon ception
i ll
Ġon ly
a w
a m
p l
Ġu n
ou ld
d ition
a b
nd er
âĢ Ķ
ou gh
sel f
u m
in ci
ĠĠĠĠ ĠĠĠĠ
Ġa t
o l
d e
Ġc a
Ġ (
i g
an s
t ain
Ġo ur
Ġthe re
m in
Ġl aw
i es
ter min
Ġm ust
Ġs p
Ġc om
t r
n ot
o f
c c
al ity
Ġp ossib
Ġpr inci

File format and content of vocab.json

0:20
1:21
2:22
3:23
4:24
5:25
6:26
7:27
8:28
9:29
10:8325
11:8326
12:9830
13:7336
14:9831
15:6113
16:6114
17:7337
18:7338
19:9832
20:8327
21:9834
22:6672
23:9835
24:9836
25:9837
26:9838
27:9839
28:9840
29:9841
30:9842
31:9843
32:9844
33:9845
34:9846
35:9847
36:9848
37:9849
38:9850
39:8328
40:7340
41:5729
42:9851
43:9852
44:9853
45:8329
46:9854
47:9855
48:9856
49:9857
50:6673
51:9859
52:9860
53:9861
54:9862
55:8330
56:9863
57:8331
58:9864
59:9865
60:9866
61:9867
62:9868
63:9869
64:9870
65:9871
66:9872
67:9873
68:5069
69:9874
70:9875
71:6674
72:9876
73:9877
74:9878
75:8334
76:9879
77:9880
78:4809
79:8335
80:4390
81:9881
83:12580
84:12581
87:9882
96:9883
100:12577
175:17600
178:9833
280:7339
501:12041
568:8332
622:8333
655:12578
712:17440
775:12579

Step 5: Loading the trained tokenizer files

  • You can load pre trained marker files
  • But we trained tokenizer ourselves and are now ready to load the file:
#@title Step 5 Loading the Trained Tokenizer Files 
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./KantaiBERT/vocab.json",
    "./KantaiBERT/merges.txt",
)

The marker can encode the sequence:

tokenizer.encode("The Critique of Pure Reason.").tokens

The operation results are as follows:
['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

We can also ask to see the number of tags in this sequence:

tokenizer.encode("The Critique of Pure Reason.")

The output will show that there are 6 Tags in the sequence:

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

The marker now processes these tags to verify that the BERT model variables are used, for example, the post processor will add start and end tags

tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

We encode the post-processing sequence, add the start tag and end tag, and there are 8 Tags

Step 6: Checking resource constraints: GPU and CUDA

KantaiBERT runs at optimum speed using a graphics processing unit (GPU).

First run the command to see if there is an NVIDIA GPU card

#@title Step 6: Checking Resource Constraints: GPU and NVIDIA 
!nvidia-smi
Fri Sep 17 13:15:11 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:03:00.0 Off |                  N/A |
| 29%   31C    P8     8W / 250W |      1MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:81:00.0 Off |                  N/A |
| 29%   37C    P8     8W / 250W |      1MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Check to make sure PyTorch sees CUDA:

#@title Checking that PyTorch Sees CUDAnot
import torch
torch.cuda.is_available()
True

Star intelligent dialogue robot series blog

Tags: Python NLP Transformer

Posted on Mon, 20 Sep 2021 06:24:19 -0400 by Klojum