在文件中查找重复的行并计算每行复制的时间长度?

本文翻译自:Find duplicate lines in a file and count how many time each line was duplicated?

Suppose I have a file similar to the following: 假设我有一个类似于以下的文件:

123 
123 
234 
234 
123 
345

I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like: 我想找出'123'重复多少次,重复'234'多少次等等。理想情况下,输出结果如下:

123  3 
234  2 
345  1

#1楼

参考:https://stackoom.com/question/SAD7/在文件中查找重复的行并计算每行复制的时间长度


#2楼

To find and count duplicate lines in multiple files, you can try the following command: 要查找和计算多个文件中的重复行,可以尝试以下命令:

sort <files> | uniq -c | sort -nr

or: 要么:

cat <files> | sort | uniq -c | sort -nr

#3楼

This will print duplicate lines only , with counts: 这将仅打印重复行 ,计数:

sort FILE | uniq -cd

or, with GNU long options (on Linux): 或者,使用GNU长选项(在Linux上):

sort FILE | uniq --count --repeated

on BSD and OSX you have to use grep to filter out unique lines: BSD和OSX上,你必须使用grep来过滤掉唯一的行:

sort FILE | uniq -c | grep -v '^ *1 '

For the given example, the result would be: 对于给定的示例,结果将是:

  3 123
  2 234

If you want to print counts for all lines including those that appear only once: 如果要打印所有行的计数,包括仅出现一次的行:

sort FILE | uniq -c

or, with GNU long options (on Linux): 或者,使用GNU长选项(在Linux上):

sort FILE | uniq --count

For the given input, the output is: 对于给定的输入,输出为:

  3 123
  2 234
  1 345

In order to sort the output with the most frequent lines on top, you can do the following (to get all results): 为了对顶部最频繁的行进行排序 ,您可以执行以下操作(以获得所有结果):

sort FILE | uniq -c | sort -nr

or, to get only duplicate lines, most frequent first: 或者,为了获得重复的行,最常见的是:

sort FILE | uniq -cd | sort -nr

on OSX and BSD the final one becomes: 在OSX和BSD上,最后一个成为:

sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr

#4楼

Via : 通过

awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' data

In awk 'dups[$1]++' command, the variable $1 holds the entire contents of column1 and square brackets are array access. awk 'dups[$1]++'命令中,变量$1保存column1的全部内容,方括号是数组访问。 So, for each 1st column of line in data file, the node of the array named dups is incremented. 因此,对于data文件中每行的第1列,名为dups的数组的节点递增。

And at the end, we are looping over dups array with num as variable and print the saved numbers first then their number of duplicated value by dups[num] . 最后,我们循环遍历带有num作为变量的dups数组,然后首先打印保存的数字,然后按dups[num]打印它们的重复值。

Note that your input file has spaces on end of some lines, if you clear up those, you can use $0 in place of $1 in command above :) 请注意,您的输入文件在某些​​行的末尾有空格,如果您清除它们,您可以在上面的命令中使用$0代替$1 :)


#5楼

In windows using "Windows PowerShell" I used the command mentioned below to achieve this 在使用“Windows PowerShell”的Windows中,我使用下面提到的命令来实现此目的

Get-Content .\file.txt | Group-Object | Select Name, Count

Also we can use the where-object Cmdlet to filter the result 我们也可以使用where-object Cmdlet来过滤结果

Get-Content .\file.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count

#6楼

Assuming there is one number per line: 假设每行有一个数字:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, eg, on Linux: 您可以在GNU版本中使用更详细的--count标志,例如,在Linux上:

sort <file> | uniq --count
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章