官方文本解析以及調試

官方RNN鏈接：https://www.tensorflow.org/tutorials/text/text_generation

數據集來自莎士比亞作品集，資源鏈接：https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt

基本配置環境：
Python3，Tensorflow， Keras。

庫文件：

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import numpy as np
import os
import time

下載數據集並讀取：

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))
# Take a look at the first 250 characters in text
print(text[:250])
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

由於數據集下載地址爲https協議，官方文檔這裏會報錯，

修改如下，在庫文件添加ssl協議，並選擇關閉

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import ssl
import numpy as np
import os
import time
ssl._create_default_https_context = ssl._create_unverified_context
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

下載成功！

通常對於NLP文本測試來說，會首先刪除文本中的特殊字符以及數字等等，
在官方文檔中給出了一種不常使用的方法，這種方法可以有效轉換特殊字符和文本中的關係，但是不適用於數據量較大的文本處理，並且一般不用。
官方給出的方法首先會創建兩個檢索表，即把字符映射到數字，把數字映射到字符。
處理過程中以全數字進行處理，，之後再用查找表還原。

#vocab中是已經處理過的數據並排序，其中set是將單詞轉化爲字母表示，sorted是將set中的集合排序。
vocab = sorted(set(text))
#創建了char2idx字符到索引映射以及idx2char索引到字符的映射集
#Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])

首先來看看vocab中的東西是什麼

我們可以從輸出來看這種處理方法的問題，首先有可能破壞詞與詞之間的關係，但是語序關係會隨之加強。
轉化後的char2idx是以數字爲表示，而idx2char只是矩陣化了字符形式。

接着我們來看官方文檔中的預測任務。
在官方文本中，首先給出了一個預測任務，即輸入一個字符序列後，訓練模型以預測輸出的每個時間步長後面的字符。
在各個過程中，由於RNN存在一個內部狀態，這個狀態取決於前置屬性，後來的每個輸出都需要結合每一個前置輸入的屬性進行預測。
爲了能夠實現這個RNN狀態，首先需要創建一個訓練模型

這個訓練模型首先包含了所有文本，將所有文本都轉化爲序列，例如上面的索引模型
每個輸入序列都會產生一個向右移動的偏置量，這個偏置的生成字符會作爲目標文本對預測結果進行修正和判斷。
例如限定一個長度爲4的目標序列，以hallow爲輸入文本。在第一個階段首先輸入的是hall，與此同時產生一個向右偏移的序列“ello”，再通過RNN訓練後產生一個結果，這個結果會與偏移序列比較並修正。在第二個階段開始後，輸入了ello的同時，會產生第二個偏移序列“llow”，並且在第二階段的學習過程中會提取前面所有階段的中間信息，可以稱之爲RNN的中間狀態。
在官方給出的RNN模型我們可以看出，對於輸入文本的語序分析來說，普通的RNN模型往往只是能夠分析前置的規律，並不能提取出從後往前的特性，對於這一點會在以後的RNN-LSTM以及SEQ2SEQ模型中得到優化，這部分以後有緣再說。

好了，接下來我們來看看RNN如何閱讀莎士比亞作品集並寫出一段相似的文章。

# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))


def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))


for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Tensorflow RNN官方文檔全面解析，附帶修改後代碼

官方文本解析以及調試

SQL優化-20231016

Tensorflow RNN官方文檔全面解析，附帶修改後代碼

Universal Language Model Fine-tuning for Text Classification

Predicting Diabetes Disease Evolution Using Financial Records and Recurrent Neural Networks 全文翻譯

Generalizing to Unseen Domains via Adversarial Data Augmentation 正文

PyQt5一日速成（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結