字符串高級操作——利用鏈表進行文本單詞頻率統計

現有一片英語短文,要求用c語言實現對該文章的詞頻統計,即利用文件讀寫方法,提取文本中的每一個單詞之後通過算法統計其出現頻率,並輸出到另外的文件中。
短文如下:

Of all the changes that have taken place in English-language newspapers during the past quarter-century, perhaps the most far-reaching has been the inexorable decline in the scope and seriousness of their arts coverage.
It is difficult to the point of impossibility for the average reader under the age of forty to imagine a time when high-quality arts criticism could be found in most big-city newspapers. Yet a considerable number of the most significant collections of criticism published in the 20th century consisted in large part of newspaper reviews. To read such books today is to marvel at the fact that their learned contents were once deemed suitable for publication in general-circulation dailies.
We are even farther removed from the unfocused newspaper reviews published in England between the turn of the 20th century and the eve of World War II, at a time when newsprint was dirt-cheap and stylish arts criticism was considered an ornament to the publications in which it appeared. In those far-off days, it was taken for granted that the critics of major papers would write in detail and at length about the events they covered. Theirs was a serious business, and even those reviewers who wore their learning lightly, like George Bernard Shaw and Ernest Newman, could be trusted to know what they were about. These men believed in journalism as a calling, and were proud to be published in the daily press. ¡°So few authors have brains enough or literary gift enough to keep their own end up in journalism,¡± Newman wrote, ¡°that I am tempted to define ¡®journalism¡¯ as ¡®a term of contempt applied by writers who are not read to writers who are.¡¯¡±
Unfortunately, these critics are virtually forgotten. Neville Cardus, who wrote for the Manchester Guardian from 1917 until shortly before his death in 1975, is now known solely as a writer of essays on the game of cricket. During his lifetime, though, he was also one of England’s foremost classical-music critics, a stylist so widely admired that his Autobiography (1947) became a best-seller. He was knighted in 1967, the first music critic to be so honored. Yet only one of his books is now in print, and his vast body of writings on music is unknown save to specialists.
Is there any chance that Cardus’s criticism will enjoy a revival? The prospect seems remote. Journalistic tastes had changed long before his death, and postmodern readers have little use for the richly upholstered Vicwardian prose in which he specialized. Moreover, the amateur tradition in music criticism has been in headlong retreat.

(其實是一篇考研英語的閱讀文章)

首先要思考的問題便是用何種數據結構存儲單詞文本以及出現頻率,最簡單想到的自然是數組,但是用數組存儲會有很多問題,用字符數組幾乎無法實現!

所以便想到用鏈表,在C語言中可以用結構體定義出類似C++中的類,用於存儲更爲複雜的數據類型,比如本題中不僅要存儲每一個掃描出來的單詞,也要存儲每個單詞的出現頻率,並且存儲過程中單詞不能重複出現,遇到重複的單詞要使該單詞對應結點中的詞頻加一,這種要求使用數組是難以達到的。

那麼我們所要實現的功能便很清楚了:

  • 首先讀文件,提取單詞(但是文本中有不少標點符號自然要去除)。
  • 其次實現鏈表的構建,將提取到的單詞存入鏈表結點裏,再計數。
  • 最後遍歷鏈表每一個節點,把結點中的單詞與詞頻輸出。

三個步驟對應三個函數,代碼如下:

#include<stdio.h>
#include<string.h>
#include<stdlib.h>
typedef struct Data{
    char *c; // 單詞
    int t;  // 詞頻
}Data;

typedef struct Node{
    Data data;  // 數據域
    struct Node *next; // 指針域
}Node, *pNode;

typedef struct HeadNode{  // 頭節點
    int total_num;
    pNode next;
}HeadNode, *pHeadNode;
// 函數聲明
void deleteNotA(char str[]);
void insertNoes(pHeadNode head, char str[]);
void showItems(pHeadNode head, FILE *out);

int main(){
    FILE *f_in, *f_out;
    char str[50] = "";
    f_in = fopen("f1.in", "r+");
    f_out = fopen("f1.out", "w+");
    pHeadNode head = (pHeadNode) malloc(sizeof(HeadNode));
    head->total_num = 0;
    head->next = NULL;
    while(fscanf(f_in, "%s", str) != EOF){
        deleteNotA(str);
        //fprintf(f_out, "%s\n", str);
        insertNoes(head, str);
        head->total_num++;
    }

    showItems(head, f_out);
    printf("單詞總數:%d", head->total_num);
    fclose(f_in);
    fclose(f_out);
    return 0;
}

void deleteNotA(char str[]){
    // 刪除非字符元素
    int length = strlen(str), i, index = 0;
    char *temp = (char *)(malloc(length + 1));
    for(i = 0; i <= length; i++){
        if(str[i] >= 65 && str[i] <= 90 || str[i] >= 97 && str[i] <= 122){
            temp[index++] = str[i];
        }
    }
    temp[index] = '\0';
    strcpy(str, temp);
    return ;
}

void insertNoes(pHeadNode head, char str[]){
    int length = strlen(str);
    //int i;
    static pNode r = NULL; // 尾指針  方便賦值
    pNode p = head->next;
    while(p){ // 檢測是否有重複的單詞
        if(strcmp(p->data.c, str) == 0){
            p->data.t++;
            return ;
        }
        p = p->next;
    }
    // 創建新節點
    pNode node = (pNode)malloc(sizeof(Node));
    node->data.c = (char*)malloc(sizeof(char) * length);
    node->data.t = 1;
    node->next = NULL;
    strcpy(node->data.c, str);
    if(!r){
        head->next = node;
    }
    else{
        r->next = node;
    }
    r = node;
    return ;
}

void showItems(pHeadNode head, FILE *out){
    pNode p = head->next;
    while(p){
        fprintf(out, "%s:%d\n", p->data.c, p->data.t);
        printf("%s:%d\n", p->data.c, p->data.t);
        p = p->next;
    }
    fprintf(out, "單詞總數:%", head->total_num);
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章