5个流行的自然语言处理库及入门用法

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文介绍了5个流行的Python NLP库和它们的入门用法,这些库涵盖了语言数据可视化、数据预处理、多任务功能、一流语言建模等用例。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文并不是要从这些解决方案中指定一个最优集合,而是给出一篇概述,介绍精选的5个流行库,希望能帮助解决你的问题。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1."},{"type":"link","attrs":{"href":"https:\/\/github.com\/huggingface\/datasets","title":null,"type":null},"content":[{"type":"text","text":"Hugging Face Datasets"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hugging Face的Datasets库本质上是一个对公开可用的NLP数据集的打包集合,带有一组通用的API和数据格式,以及一些辅助功能。以下是关于它的介绍:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"收集了最多的用于ML模型的即开即用NLP数据集,具有快速、易用且高效的数据操作工具。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以通过以下方式轻松安装Datasets库:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"pip install datasets"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根据介绍,Datasets提供了两大特性:"},{"type":"text","marks":[{"type":"strong"}],"text":"用于许多公共数据集的单行数据加载器,以及高效的数据预处理"},{"type":"text","text":"。但它的介绍没有提到这个库的另一大特性:与NLP任务相关的许多内置评估指标。这个库还有其他一些特性,例如数据集的后端内存管理以及与流行的Python工具(如NumPy、Pandas)和主流机器学习平台(TensorFlow和PyTorch)的互操作性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们先来看看如何加载一个数据集:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"from datasets import load_dataset, list_datasets\nprint(f\"The Hugging Face datasets library contains {len(list_datasets())} datasets\")\nsquad_dataset = load_dataset('squad')\nprint(squad_dataset['train'][0])\nprint(squad_dataset)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"The Hugging Face datasets library contains 635 datasets\nReusing dataset squad (\/home\/matt\/.cache\/huggingface\/datasets\/squad\/plain_text\/1.0.0\/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7)\n{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}\nDatasetDict({\n train: Dataset({\n features: ['id', 'title', 'context', 'question', 'answers'],\n num_rows: 87599\n })\n validation: Dataset({\n features: ['id', 'title', 'context', 'question', 'answers'],\n num_rows: 10570\n })\n})"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"加载指标也很简单:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"from datasets import load_metric, list_metrics\nprint(f\"The Hugging Face datasets library contains {len(list_metrics())} metrics\")\nprint(f\"Available metrics are: {list_metrics()}\")\n# Load a metric\nsquad_metric = load_metric('squad')"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"The Hugging Face datasets library contains 19 metrics\nAvailable metrics are: ['accuracy', 'bertscore', 'bleu', 'bleurt', 'comet', 'coval', 'f1', 'gleu', 'glue', 'indic_glue', 'meteor', 'precision', 'recall', 'rouge', 'sacrebleu', 'seqeval', 'squad', 'squad_v2', 'xnli']"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你想用它们做什么都随意,但有了这个库,你就可以轻松加载可公开访问的数据集,和久经考验的真实评估指标了。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2."},{"type":"link","attrs":{"href":"https:\/\/github.com\/jbesomi\/texthero","title":null,"type":null},"content":[{"type":"text","text":"TextHero"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TextHero在其GitHub存储库中的介绍很简单:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文本预处理、表示和可视化,助你从零迈向大师。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这几句话很好地解释了这个库可以解决的问题,下面我们再深入研究一下为什么我们就要用它。从repo中我们可以看到更具体的说明:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TextHero只有一个非常务实的目标:为开发人员腾出空闲时间。文本数据处理起来可能会很痛苦,在大多数情况下有一个默认管道的话上手起来就轻松多了。总有时间回来改进以前的工作。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在你知道了为什么要使用TextHero,它的安装方法如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"pip install texthero"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/texthero.org\/docs\/getting-started","title":null,"type":null},"content":[{"type":"text","text":"入门指南"}]},{"type":"text","text":"介绍了你可以用它做的工作,几行代码就可以搞定。使用TextHero Github存储库中的以下示例,我们会加载一个数据集,清理它,并创建一个TF-IDF表示,执行主成分分析(PCA)并绘制PCA的结果。"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"def text_texthero():\n\timport texthero as hero\n\timport pandas as pd\n\tdf = pd.read_csv(\"https:\/\/github.com\/jbesomi\/texthero\/raw\/master\/dataset\/bbcsport.csv\")\n\tdf['pca'] = (\n\t\tdf['text']\n\t\t\t.pipe(hero.clean)\n\t\t\t.pipe(hero.tfidf)\n\t\t\t.pipe(hero.pca)\n\t\t)\n\thero.scatterplot(df, 'pca', color='topic', title=\"PCA BBC Sport news\")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d9\/1b\/d9002dbb81fda60d82d3a3e765107f1b.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用TextHero可以完成的工作还有很多,请继续查阅"},{"type":"link","attrs":{"href":"https:\/\/texthero.org\/docs\/api-preprocessing","title":null,"type":null},"content":[{"type":"text","text":"文档"}]},{"type":"text","text":",了解数据清理和预处理、可视化、表示、基本NLP任务等相关信息。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3."},{"type":"link","attrs":{"href":"https:\/\/spacy.io\/","title":null,"type":null},"content":[{"type":"text","text":"spaCy"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spaCy是专门设计的,其宗旨是成为一个用于实现生产就绪系统的有用库。以下是关于它的介绍:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spaCy旨在帮助你完成真正的工作——构建真正的产品,或收集真正的见解。这个库尊重你的宝贵时间,并尽量避免浪费它。它很容易安装,其API简单而高效。我们愿意将spaCy视为自然语言处理领域的Ruby on Rails。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以当你准备开始做一些真正的工作时,你需要先安装spaCy和至少一个语言模型。在下面这个例子中我们将使用它的英语语言模型。库和语言模型只需几行代码即可安装:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"pip install spacy\npython -m spacy download en"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要开始使用spaCy,我们将使用示例文本的这句话:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"sample = u\"I can't imagine spending $3000 for a single bedroom apartment in N.Y.C.\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在我们导入spaCy和一个英文停用词列表。我们还将英语语言模型作为Language对象加载(根据spaCy约定,我们将其称为“nlp”),然后在示例文本上调用nlp对象,它会返回一个经过处理的Doc对象(我们将其称为“doc”)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"import spacy\nfrom spacy.lang.en.stop_words import STOP_WORDS\nnlp = spacy.load('en')\ndoc = nlp(sample)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这样就行了?根据spaCy"},{"type":"link","attrs":{"href":"https:\/\/spacy.io\/usage\/spacy-101","title":null,"type":null},"content":[{"type":"text","text":"文档"}]},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即使一个Doc经过了处理——例如拆分为单个单词并注释——它仍"},{"type":"text","marks":[{"type":"strong"}],"text":"包含原始文本的所有信息"},{"type":"text","text":",如空格字符。你随时可以将一个符号(token)的偏移量获取到原始字符串中,或​​者将符号及其尾随空格连接起来来重建原始字符串。这样,在使用spaCy处理文本时,你永远不会丢失任何信息。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在我们来看看处理过的样本:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"# Print out tokens\nprint(\"Tokens:\\n=======)\nfor token in doc:\n print(token)\n# Identify stop words\nprint(\"Stop words:\\n===========\")\nfor word in doc:\n if word.is_stop == True:\n print(word)\n# POS tagging\nprint(\"POS tagging:\\n============\")\nfor token in doc:\n print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,\n token.shape_, token.is_alpha, token.is_stop)\n# Print out named entities\nprint(\"Named entities:\\n===============\")\nfor ent in doc.ents:\n print(ent.text, ent.start_char, ent.end_char, ent.label_)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"Tokens:\n=======\nI\nca\nn't\nimagine\nspending\n$\n3000\nfor\na\nsingle\nbedroom\napartment\nin\nN.Y.C.\nStop words:\n===========\nca\nfor\na\nin\nPOS tagging:\n============\nI -PRON- PRON PRP nsubj X True False\nca can VERB MD aux xx True True\nn't not ADV RB neg x'x False False\nimagine imagine VERB VB ROOT xxxx True False\nspending spend VERB VBG xcomp xxxx True False\n$ $ SYM $ nmod $ False False\n3000 3000 NUM CD dobj dddd False False\nfor for ADP IN prep xxx True True\na a DET DT det x True True\nsingle single ADJ JJ amod xxxx True False\nbedroom bedroom NOUN NN compound xxxx True False\napartment apartment NOUN NN pobj xxxx True False\nin in ADP IN prep xx True True\nN.Y.C. n.y.c. PROPN NNP pobj X.X.X. False False\nNamed entities:\n===============\n3000 26 30 MONEY\nN.Y.C. 65 71 GPE"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spaCy功能强大、坚持己见(opinionated),可用于从预处理到表示再到建模的各种NLP任务。查看spaCy"},{"type":"link","attrs":{"href":"https:\/\/spacy.io\/usage\/spacy-101","title":null,"type":null},"content":[{"type":"text","text":"文档"}]},{"type":"text","text":",看看你可以用它做哪些事情。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4."},{"type":"link","attrs":{"href":"https:\/\/github.com\/huggingface\/transformers","title":null,"type":null},"content":[{"type":"text","text":"Hugging Face Transformers"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hugging Face的Transformers库已成为NLP实践不可或缺的一部分,这一点再怎么强调也不为过。根据GitHub存储库的介绍:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"用于PyTorch和TensorFlow 2.0的一流自然语言处理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformers提供了数千个预训练模型,可以对100多种语言的文本执行分类、信息提取、问答、摘要、翻译、文本生成等任务。它的目标是让所有人都更容易使用尖端的NLP技术。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformers提供了很多API,可以用来在给定文本上快速下载和使用这些预训练模型,在你自己的数据集上对它们进行微调,然后在我们的模型中心与社区共享。同时,定义架构的python模块都可以作为一个独立的模块进行修改,以实现快速的研究实验。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformers由两个最流行的深度学习库PyTorch和TensorFlow提供支持,它们之间无缝集成,允许你使用一个模型来训练你的模型,然后加载它来推理另一个。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以使用"},{"type":"link","attrs":{"href":"https:\/\/transformer.huggingface.co\/","title":null,"type":null},"content":[{"type":"text","text":"Write With Transformer"}]},{"type":"text","text":"在线测试Transformer库,这是该库的官方功能演示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个很复杂的库安装起来却很简单:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"pip install transformers"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformers库包罗万象,你可以花很多时间学习它的所有细节。然而,它自带的管道API让你可以立即使用模型,几乎不需要配置。以下是使用Transformers管道进行分类的一个示例(请注意,应先安装TensorFlow或PyTorch才能继续):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"from transformers import pipeline\n# Allocate a pipeline for sentiment-analysis\nclassifier = pipeline('sentiment-analysis')\n# Classify text\nprint(classifier('I am a fan of KDnuggets, its useful content, and its helpful editors!'))"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"[{'label': 'POSITIVE', 'score': 0.9954679012298584}]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多简单,很有趣吧。这个管道使用了一个预训练的模型以及用于该模型的预处理,即使没有微调,结果也非常不错。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是第二个管道示例,这次是问题回答:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"from transformers import pipeline\n# Allocate a pipeline for question-answering\nquestion_answerer = pipeline('question-answering')\n# Ask a question\nanswer = question_answerer({\n\t'question': 'Where is KDnuggets headquartered?',\n\t'context': 'KDnuggets was founded in February of 1997 by Gregory Piatetsky in Brookline, Massachusetts.'\n})\n# Print the answer\nprint(answer)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"{'score': 0.9153624176979065, 'start': 66, 'end': 90, 'answer': 'Brookline, Massachusetts'}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当然上面是一对简单的示例,但这些管道非常强大,绝不止是解决一些与KDnuggets相关的琐碎任务那么简单!你可以在"},{"type":"link","attrs":{"href":"https:\/\/huggingface.co\/transformers\/main_classes\/pipelines.html","title":null,"type":null},"content":[{"type":"text","text":"此处"}]},{"type":"text","text":"阅读有关管道的更多信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformers让最先进的模型也能轻松供所有人使用。请访问它的GitHub"},{"type":"link","attrs":{"href":"https:\/\/github.com\/huggingface\/transformers","title":null,"type":null},"content":[{"type":"text","text":"存储库"}]},{"type":"text","text":",探索更多精彩。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"5."},{"type":"link","attrs":{"href":"https:\/\/github.com\/JasonKessler\/scattertext","title":null,"type":null},"content":[{"type":"text","text":"Scattertext"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scattertext用于创建吸引人的可视化图像,来描述语言在不同文档类型之间的差异。根据其GitHub仓库的介绍:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这是一种用于在语料库中查找区分术语,并将它们呈现在交互式HTML散点图中的工具。与术语对应的点被有选择地标记出来,防止它们与其他标签或点重叠。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"还没搞懂的话,我们先来安装它:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"pip install scattertext"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下示例来自它的GitHub存储库,可视化了2012年美国大选中使用的术语。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2,000个与党派最相关的一元分词(unigram)显示为散点图中的点。它们的x轴和y轴分别是共和党和民主党发言人使用它们的密集等级。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"请注意,运行示例代码会生成一个HTML文件,然后可以在浏览器中查看该文件并与之交互。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"import scattertext as st\ndf = st.SampleCorpora.ConventionData2012.get_data().assign(\n parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)\n)\ncorpus = st.CorpusFromParsedDocuments(\n df, category_col='party', parsed_col='parse'\n).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))\nhtml = st.produce_scattertext_explorer(\n corpus,\n category='democrat', category_name='Democratic', not_category_name='Republican',\n minimum_term_frequency=0, pmi_threshold_coefficient=0,\n width_in_pixels=1000, metadata=corpus.get_df()['speaker'],\n transform=st.Scalers.dense_rank\n)\nopen('.\/demo_compact.html', 'w').write(html)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查看时保存的HTML文件的结果(下面显示的是静态图像,因此不是交互式的):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/0c\/82\/0c65161f3060d4a07ebc027731b5a882.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scattertext的用途很窄,但效果很好。它的可视化输出绝对很漂亮,而且富含见解。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文链接:"},{"type":"link","attrs":{"href":"https:\/\/www.kdnuggets.com\/2021\/02\/getting-started-5-essential-nlp-libraries.html","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/www.kdnuggets.com\/2021\/02\/getting-started-5-essential-nlp-libraries.html"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章