題目

實現一種簡單原始的文件相似度計算，即以兩文件的公共詞彙佔總詞彙的比例來定義相似度。爲簡化問題，這裏不考慮中文（因爲分詞太難了），只考慮長度不小於3、且不超過10的英文單詞，長度超過10的只考慮前10個字母。

輸入格式:

輸入首先給出正整數N（≤100），爲文件總數。隨後按以下格式給出每個文件的內容：首先給出文件正文，最後在一行中只給出一個字符#，表示文件結束。在N個文件內容結束之後，給出查詢總數M（≤104），隨後M行，每行給出一對文件編號，其間以空格分隔。這裏假設文件按給出的順序從1到N編號。

輸出格式:

針對每一條查詢，在一行中輸出兩文件的相似度，即兩文件的公共詞彙量佔兩文件總詞彙量的百分比，精確到小數點後1位。注意這裏的一個“單詞”只包括僅由英文字母組成的、長度不小於3、且不超過10的英文單詞，長度超過10的只考慮前10個字母。單詞間以任何非英文字母隔開。另外，大小寫不同的同一單詞被認爲是相同的單詞，例如“You”和“you”是同一個單詞。

輸入樣例:

3
Aaa Bbb Ccc
#
Bbb Ccc Ddd
#
Aaa2 ccc Eee
is at Ddd@Fff
#
2
1 2
1 3

輸出樣例:

50.0%
33.3%

#include <iostream>
#include<map>
#include<list>
#include<ctype.h>
#include<string>
#include<cctype>
#include<algorithm>
//倒排索引；
//(inverted index)
using namespace std;
//link[i][i]: words in a file;
//link[j][j]: public words;
int link[105][105];

//index:
map<string, list<int> > words;

int main(int argc, const char * argv[]) {
	int n, i, j, k;
	cin >> n;
	string line, aword;
	map<string, list<int> >::iterator word;
//input
	for (i = 1; i <= n; ++i) {
		int count = 0;
		getline(cin, line);
		while (line[0] != '#') {
			j = 0;
			while (j < line.length()) {
				k = 0;
				aword.clear();
        //create word;
				while (k < 10 && isalpha(line[j])) {
					aword += line[j];
					++j;
					++k;
				}
				while (isalpha(line[j]))
					++j;
          //deal with a word;
				if (aword.length()>=3) {
					transform(aword.begin(), aword.end(), aword.begin(), ::tolower);
					word = words.find(aword);
					if (word==words.end()||word->second.back() != i) {
          //if not in this file but exist: 
						if (word != words.end()) {
							list<int>::iterator f;
							f = word->second.begin();
              //update public words;
							while (f != word->second.end()) {
								link[i][*f]++;
								link[*f][i]++;
								++f;
							}
						}
            //if not exist:
						words[aword].push_back(i);
						++count;
					}
				}
				else
					++j;
			}
			getline(cin, line);
		}
		link[i][i]= count;
	}
//output:
	int m, f1, f2;
	cin >> m;
	double per;
	for (i = 0; i < m; ++i) {
		cin >> f1 >> f2;
		per = 1.0*link[f1][f2] / (link[f1][f1] + link[f2][f2] - link[f1][f2])*100;
		printf("%.1f%%\n", per);
	}
	return 0;
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

PTA 數據結構與算法題目集（中文）7-44 基於詞頻的文件相似度 (30分)

思路

題目

輸入格式:

輸出格式:

輸入樣例:

輸出樣例:

rust 入門筆記： rustlings（推薦一些學習rust語法的一些非常好的小練習）

基於QT開發的開源局域網聯機UNO卡牌遊戲報告（附github倉庫地址）

MIT 6.828 操作系統工程 2018 fall xv6 工具鏈搭建與測試

rust 入門筆記：環境安裝、hello World、Cargo

MIT 6.828 操作系統工程 lab1 2018 fall part1 & part2 筆記 and 中文註釋源代碼閱讀

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結