幾大常用深度學習python包使用教程 ---- Adam Studio

Python Deep Learning Packages

在這裏插入圖片描述

State of open source deep learning frameworks

keras[11]
Well known for being minimalistic, the Keras neural network library (with a supporting interface of Python) supports both convolutional and recurrent networks that are capable of running on either TensorFlow or Theano. The library is written in Python and was developed keeping quick experimentation as its USP.

TensorFlow
TensorFlow is arguably one of the best deep learning frameworks and has been adopted by several giants such as Airbus, Twitter, IBM, and others mainly due to its highly flexible system architecture.

Caffe
Caffe is a deep learning framework that is supported with interfaces like C, C++, Python, and MATLAB as well as the command line interface. It is well known for its speed and transposability and its applicability in modeling convolution neural networks (CNN).

Microsoft Cognitive Toolkit/CNTK
Popularly known for easy training and the combination of popular model types across servers, the Microsoft Cognitive Toolkit (previously known as CNTK) is an open-source deep learning framework to train deep learning models. It performs efficient convolution neural networks and training for image, speech, and text-based data. Similar to Caffe, it is supported by interfaces such as Python, C++, and the command line interface.

Torch/PyTorch
Torch is a scientific computing framework that offers wide support for machine learning algorithms. It is a Lua-based deep learning framework and is used widely amongst industry giants such as Facebook, Twitter, and Google. It employs CUDA along with C/C++ libraries for processing and was basically made to scale the production of building models and provide overall flexibility.

MXNet
Designed specifically for the purpose of high efficiency, productivity, and flexibility, MXNet(pronounced as mix-net) is a deep learning framework supported by Python, R, C++, and Julia.

Chainer
Highly powerful, dynamic and intuitive, Chainer is a Python-based deep learning framework for neural networks that is designed by the run strategy. Compared to other frameworks that use the same strategy, you can modify the networks during runtime, allowing you to execute arbitrary control flow statements.

Deeplearning4j
Parallel training through iterative reduce, microservice architecture adaptation, and distributed CPUs and GPUs are some of the salient features of the Deeplearning4j deep learning framework. It is developed in Java as well as Scala and supports other JVM languages, too.

Theano
Theano is beautiful. Without Theano, we wouldn’t have anywhere near the amount of deep learning libraries (specifically in Python) that we do today. In the same way that without NumPy, we couldn’t have SciPy, scikit-learn, and scikit-image, the same can be said about Theano and higher-level abstractions of deep learning.

Lasagne
Lasagne is a lightweight library used to construct and train networks in Theano. The key term here is lightweight — it is not meant to be a heavy wrapper around Theano like Keras is. While this leads to your code being more verbose, it does free you from any restraints, while still giving you modular building blocks based on Theano.

PaddlePaddle
PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.

Import

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from wordcloud import WordCloud as wc
from nltk.corpus import stopwords
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
from pandas import get_dummies
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import string
import scipy
import numpy
import nltk
import json
import sys
import csv
import os

version

print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

在這裏插入圖片描述

Setup

A few tiny adjustments for better code readability

sns.set(style='white', context='notebook', palette='deep')
pylab.rcParams['figure.figsize'] = 12,8
warnings.filterwarnings('ignore')
mpl.style.use('ggplot')
sns.set_style('white')
%matplotlib inline

NLTK

In this kernel, we use the NLTK library So, before we begin the next step, we will first introduce this library. The Natural Language Toolkit (NLTK) is one of the leading platforms for working with human language data and Python, the module NLTK is used for natural language processing. NLTK is literally an acronym for Natural Language Toolkit. with it you can tokenizing words and sentences. NLTK is a library of Python that can mine (scrap and upload data) and analyse very large amounts of textual data using computational methods.

在這裏插入圖片描述

from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "All work and no play makes jack a dull boy, all work and no play"
print(word_tokenize(data))

在這裏插入圖片描述

All of them are words except the comma. Special characters are treated as separate tokens.

5-5-1 Tokenizing sentences

The same principle can be applied to sentences. Simply change the to sent_tokenize() We have added two sentences to the variable data:

from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
print(sent_tokenize(data))

在這裏插入圖片描述

NLTK and arrays

If you wish to you can store the words and sentences in arrays

from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
 
phrases = sent_tokenize(data)
words = word_tokenize(data)
 
print(phrases)
print(words)

在這裏插入圖片描述

NLTK stop words

Stop words are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead.[12]

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
 
for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)
 
print(wordsFiltered)

在這裏插入圖片描述

A module has been imported:

from nltk.corpus import stopwords

We get a set of English stop words using the line:

stopWords = set(stopwords.words('english'))

The returned list stopWords contains 153 stop words on my computer. You can view the length or contents of this array with the lines:

print(len(stopWords))
print(stopWords)

在這裏插入圖片描述

We create a new list called wordsFiltered which contains all words which are not stop words. To create it we iterate over the list of words and only add it if its not in the stopWords list.

for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)

NLTK – stemming

Start by defining some words:

words = ["game","gaming","gamed","games"]
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

And stem the words in the list using:

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

words = ["game","gaming","gamed","games"]
ps = PorterStemmer()
 
for word in words:
    print(ps.stem(word))

在這裏插入圖片描述

NLTK speech tagging

The module NLTK can automatically tag speech. Given a sentence or paragraph, it can label words such as verbs, nouns and so on.

NLTK – speech tagging example The example below automatically tags words with a corresponding class.

import nltk
from nltk.tokenize import PunktSentenceTokenizer
 
document = 'Whether you\'re new to programming or an experienced developer, it\'s easy to learn and use Python.'
sentences = nltk.sent_tokenize(document)   
for sent in sentences:
    print(nltk.pos_tag(nltk.word_tokenize(sent)))

在這裏插入圖片描述
We can filter this data based on the type of word:

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
 
document = 'Today the Netherlands celebrates King\'s Day. To honor this tradition, the Dutch embassy in San Francisco invited me to'
sentences = nltk.sent_tokenize(document)   
 
data = []
for sent in sentences:
    data = data + nltk.pos_tag(nltk.word_tokenize(sent))
 
for word in data: 
    if 'NNP' in word[1]: 
        print(word)

在這裏插入圖片描述

sns.set(style='white', context='notebook', palette='deep')
pylab.rcParams['figure.figsize'] = 12,8
warnings.filterwarnings('ignore')
mpl.style.use('ggplot')
sns.set_style('white')
%matplotlib inline

Natural Language Processing – prediction

We can use natural language processing to make predictions. Example: Given a product review, a computer can predict if its positive or negative based on the text. In this article you will learn how to make a prediction program based on natural language processing.

5-5-5-6-1 nlp prediction example

Given a name, the classifier will predict if it’s a male or female.

To create our analysis program, we have several steps:

  • Data preparation
  • Feature extraction
  • Training
  • Prediction
  • Data preparation The first step is to prepare data. We use the names set included with nltk.
from nltk.corpus import names
 
# Load data and training 
names = ([(name, 'male') for name in names.words('male.txt')] + 
	 [(name, 'female') for name in names.words('female.txt')])

This dataset is simply a collection of tuples. To give you an idea of what the dataset looks like:

[(u'Aaron', 'male'), (u'Abbey', 'male'), (u'Abbie', 'male')]
[(u'Zorana', 'female'), (u'Zorina', 'female'), (u'Zorine', 'female')]

在這裏插入圖片描述

You can define your own set of tuples if you wish, its simply a list containing many tuples.

Feature extraction Based on the dataset, we prepare our feature. The feature we will use is the last letter of a name: We define a featureset using:

featuresets = [(gender_features(n), g) for (n,g) in names] and the features (last letters) are extracted using:

def gender_features(word): 
    return {'last_letter': word[-1]}

Training and prediction We train and predict using:

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
 
def gender_features(word): 
    return {'last_letter': word[-1]} 
 
# Load data and training 
names = ([(name, 'male') for name in names.words('male.txt')] + 
	 [(name, 'female') for name in names.words('female.txt')])
 
featuresets = [(gender_features(n), g) for (n,g) in names] 
train_set = featuresets
classifier = nltk.NaiveBayesClassifier.train(train_set) 
 
# Predict
print(classifier.classify(gender_features('Frank')))

在這裏插入圖片描述

If you want to give the name during runtime, change the last line to:

# Predict, you can change name
name = 'Sarah'
print(classifier.classify(gender_features(name)))

在這裏插入圖片描述

EDA

In this section, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.

  • Which variables suggest interesting relationships?
  • Which observations are unusual?
  • Analysis of the features!

By the end of the section, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:

  • Data Collection
  • Visualization
  • Data Cleaning
  • Data Preprocessing
    在這裏插入圖片描述

Data Collection

Data collection is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]

I start Collection Data by the training and testing datasets into Pandas DataFrames.

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

<< Note 1 >>

  • Each row is an observation (also known as : sample, example, instance, record).
  • Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate).
train.sample(1) 

在這裏插入圖片描述

test.sample(1) 

在這裏插入圖片描述

Or you can use others command to explorer dataset, such as

train.tail(1)

在這裏插入圖片描述

Features

Features can be from following types:

  • numeric
  • categorical
  • ordinal
  • datetime
  • coordinates

Find the type of features in Qoura dataset?!

For getting some information about the dataset you can use info() command.

print(train.info())

在這裏插入圖片描述

print(test.info())

在這裏插入圖片描述

Explorer Dataset

1- Dimensions of the dataset.

2- Peek at the data itself.

3- Statistical summary of all attributes.

4- Breakdown of the data by the class variable.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

# shape for train and test
print('Shape of train:',train.shape)
print('Shape of test:',test.shape)

在這裏插入圖片描述

#columns*rows
train.size

在這裏插入圖片描述

After loading the data via pandas, we should checkout what the content is, description and via the following:

type(train)

在這裏插入圖片描述

type(test)

在這裏插入圖片描述

train.describe()

在這裏插入圖片描述

To pop up 5 random rows from the data set, we can use sample(5) function and find the type of features.

train.sample(5) 

在這裏插入圖片描述

Data Cleaning

When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.

The primary goal of data cleaning is to detect and remove errors and anomalies to increase the value of data in analytics and decision making. While it has been the focus of many researchers for several years, individual problems have been addressed separately. These include missing value imputation, outliers detection, transformations, integrity constraints violations detection and repair, consistent query answering, deduplication, and many other related problems such as profiling and constraints mining.[4]

How many NA elements in every column!!

Good news, it is Zero!

To check out how many null info are on the dataset, we can use isnull().sum().

train.isnull().sum()

在這裏插入圖片描述

But if we had , we can just use dropna()(be careful sometimes you should not do this!)

# remove rows that have NA's
print('Before Droping',train.shape)
train = train.dropna()
print('After Droping',train.shape)

在這裏插入圖片描述

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

To print dataset columns, we can use columns atribute.

train.columns

在這裏插入圖片描述
You see number of unique item for Target with command below:

train_target = train['target'].values

np.unique(train_target)

在這裏插入圖片描述
YES, quora problem is a binary classification! ?

To check the first 5 rows of the data set, we can use head(5).

train.head(5) 

在這裏插入圖片描述
Or to check out last 5 row of the data set, we use tail() function.

train.tail() 

在這裏插入圖片描述
To give a statistical summary about the dataset, we can use describe()

train.describe() 

在這裏插入圖片描述

As you can see, the statistical information that this command gives us is not suitable for this type of data describe() is more useful for numerical data sets

Data Preprocessing

Data preprocessing refers to the transformations applied to our data before feeding it to the algorithm.

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. there are plenty of steps for data preprocessing and we just listed some of them in general(Not just for Quora) :

  • removing Target column (id)
  • Sampling (without replacement)
  • Making part of iris unbalanced and balancing (with undersampling and SMOTE)
  • Introducing missing values and treating them (replacing by average values)
  • Noise filtering
  • Data discretization
  • Normalization and standardization
  • PCA analysis
  • Feature selection (filter, embedded, wrapper)
  • Etc.

What methods of preprocessing can we run on Quora?!

<< Note 2 >> in pandas’s data frame you can perform some query such as "where"

train.where(train ['target']==1).count()

在這裏插入圖片描述
As you can see in the below in python, it is so easy perform some query on the dataframe:

train[train['target']>1]

在這裏插入圖片描述

Some examples of questions that they are insincere

train[train['target']==1].head(5)

在這裏插入圖片描述

Is data set imbalance?

train_target.mean()

在這裏插入圖片描述

A large part of the data is unbalanced, but how can we solve it?

train["target"].value_counts()
# data is imbalance

在這裏插入圖片描述

Imbalanced dataset is relevant primarily in the context of supervised machine learning involving two or more classes.

Imbalance means that the number of data points available for different the classes is different: If there are two classes, then balanced data would mean 50% points for each of the class. For most machine learning techniques, little imbalance is not a problem. So, if there are 60% points for one class and 40% for the other class, it should not cause any significant performance degradation. Only when the class imbalance is high, e.g. 90% points for one class and 10% for the other, standard optimization criteria or performance measures may not be as effective and would need modification.

在這裏插入圖片描述
A typical example of imbalanced data is encountered in e-mail classification problem where emails are classified into ham or spam. The number of spam emails is usually lower than the number of relevant (ham) emails. So, using the original distribution of two classes leads to imbalanced dataset.

Using accuracy as a performace measure for highly imbalanced datasets is not a good idea. For example, if 90% points belong to the true class in a binary classification problem, a default prediction of true for all data poimts leads to a classifier which is 90% accurate, even though the classifier has not learnt anything about the classification problem at hand![9]

Exploreing Question

question = train['question_text']
i=0
for q in question[:5]:
    i=i+1
    print('sample '+str(i)+':' ,q)

在這裏插入圖片描述

text_withnumber = train['question_text']
result = ''.join([i for i in text_withnumber if not i.isdigit()])

Some Feature Engineering

NLTK is one of the leading platforms for working with human language data and Python, the module NLTK is used for natural language processing. NLTK is literally an acronym for Natural Language Toolkit.

We get a set of English stop words using the line

#from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))

The returned list stopWords contains 179 stop words on my computer. You can view the length or contents of this array with the lines:

print(len(eng_stopwords))
print(eng_stopwords)

在這裏插入圖片描述

Number of words in the text

train["num_words"] = train["question_text"].apply(lambda x: len(str(x).split()))
test["num_words"] = test["question_text"].apply(lambda x: len(str(x).split()))
print('maximum of num_words in train',train["num_words"].max())
print('min of num_words in train',train["num_words"].min())
print("maximum of  num_words in test",test["num_words"].max())
print('min of num_words in train',test["num_words"].min())

在這裏插入圖片描述

Number of unique words in the text

train["num_unique_words"] = train["question_text"].apply(lambda x: len(set(str(x).split())))
test["num_unique_words"] = test["question_text"].apply(lambda x: len(set(str(x).split())))
print('maximum of num_unique_words in train',train["num_unique_words"].max())
print('mean of num_unique_words in train',train["num_unique_words"].mean())
print("maximum of num_unique_words in test",test["num_unique_words"].max())
print('mean of num_unique_words in train',test["num_unique_words"].mean())

在這裏插入圖片描述

Number of characters in the text

train["num_chars"] = train["question_text"].apply(lambda x: len(str(x)))
test["num_chars"] = test["question_text"].apply(lambda x: len(str(x)))
print('maximum of num_chars in train',train["num_chars"].max())
print("maximum of num_chars in test",test["num_chars"].max())

在這裏插入圖片描述

Number of stopwords in the text

train["num_stopwords"] = train["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
test["num_stopwords"] = test["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
print('maximum of num_stopwords in train',train["num_stopwords"].max())
print("maximum of num_stopwords in test",test["num_stopwords"].max())

在這裏插入圖片描述
Number of punctuations in the text

train["num_punctuations"] =train['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test["num_punctuations"] =test['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
print('maximum of num_punctuations in train',train["num_punctuations"].max())
print("maximum of num_punctuations in test",test["num_punctuations"].max())

在這裏插入圖片描述

Number of title case words in the text

train["num_words_upper"] = train["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test["num_words_upper"] = test["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
print('maximum of num_words_upper in train',train["num_words_upper"].max())
print("maximum of num_words_upper in test",test["num_words_upper"].max())

在這裏插入圖片描述

Number of title case words in the text

train["num_words_title"] = train["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test["num_words_title"] = test["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
print('maximum of num_words_title in train',train["num_words_title"].max())
print("maximum of num_words_title in test",test["num_words_title"].max())

在這裏插入圖片描述

Average length of the words in the text

train["mean_word_len"] = train["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test["mean_word_len"] = test["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
print('mean_word_len in train',train["mean_word_len"].max())
print("mean_word_len in test",test["mean_word_len"].max())

在這裏插入圖片描述

We add some new feature to train and test data set now, print columns agains

print(train.columns)
train.head(1)

在這裏插入圖片描述

Preprocessing and generation pipelines depend on a model type

What is Tokenizer?

Tokenizing raw text data is an important pre-processing step for many NLP methods. As explained on wikipedia, tokenization is “the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.” In the context of actually working through an NLP analysis, this usually translates to converting a string like “My favorite color is blue” to a list or array like [“My”, “favorite”, “color”, “is”, “blue”].[11]

import nltk
mystring = "I love Kaggle"
mystring2 = "I'd love to participate in kaggle competitions."
nltk.word_tokenize(mystring)

在這裏插入圖片描述

nltk.word_tokenize(mystring2)

在這裏插入圖片描述

WordCloud

ef generate_wordcloud(text): 
    wordcloud = wc(relative_scaling = 1.0,stopwords = eng_stopwords).generate(text)
    fig,ax = plt.subplots(1,1,figsize=(10,10))
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis("off")
    ax.margins(x=0, y=0)
    plt.show()
text =" ".join(train.question_text)
generate_wordcloud(text)

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章