来自MongoDB的随机记录

本文翻译自:Random record from MongoDB

I am looking to get a random record from a huge (100 million record) mongodb . 我希望从一个巨大的(1亿条记录)的mongodb获得随机记录。

What is the fastest and most efficient way to do so? 最快,最有效的方法是什么? The data is already there and there are no field in which I can generate a random number and obtain a random row. 数据已经存在,并且没有可以在其中生成随机数并获得随机行的字段。

Any suggestions? 有什么建议么?


#1楼

参考:https://stackoom.com/question/Bqgv/来自MongoDB的随机记录


#2楼

If you have a simple id key, you could store all the id's in an array, and then pick a random id. 如果您有一个简单的ID键,则可以将所有ID存储在一个数组中,然后选择一个随机ID。 (Ruby answer): (Ruby答案):

ids = @coll.find({},fields:{_id:1}).to_a
@coll.find(ids.sample).first

#3楼

When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. 当我遇到类似的解决方案时,我回溯并发现业务请求实际上是为了对要显示的库存进行某种形式的轮换。 In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB. 在这种情况下,有更好的选择,这些选择可以从诸如Solr之类的搜索引擎获得答案,而不是来自诸如MongoDB之类的数据存储。

In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. 简而言之,由于需要“智能旋转”内容,我们应该做的是代替个人q得分修饰符,而不是对所有文档使用随机数。 To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute aq score modifier. 要自己实施此操作(假设用户数量很少),您可以为每个用户存储一个文档,该文档具有productId,展示次数,点击次数,上次查看日期以及企业发现对计算aq分数有意义的任何其他因素修饰符。 When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory). 检索要显示的集合时,通常您从数据存储中请求的文档要比最终用户请求的要多,然后应用q得分修饰符,获取最终用户请求的记录数,然后将结果页面随机化设置,因此只需对应用程序层(内存中)中的文档进行排序。

If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user. 如果用户范围太大,则可以将用户分类为行为组,然后按行为组而不是用户进行索引。

If the universe of products is small enough, you can create an index per user. 如果产品范围足够小,则可以为每个用户创建一个索引。

I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution. 我发现该技术效率更高,但更重要的是,在创建使用软件解决方案的相关有价值的经验方面,效率更高。


#4楼

If you're using mongoid, the document-to-object wrapper, you can do the following in Ruby. 如果您使用的是文档到对象包装器mongoid,则可以在Ruby中执行以下操作。 (Assuming your model is User) (假设您的模型是用户)

User.all.to_a[rand(User.count)]

In my .irbrc, I have 在我的.irbrc中,我有

def rando klass
    klass.all.to_a[rand(klass.count)]
end

so in rails console, I can do, for example, 因此在Rails控制台中,我可以执行例如

rando User
rando Article

to get documents randomly from any collection. 从任何集合中随机获取文档。


#5楼

non of the solutions worked well for me. 没有一种解决方案适合我。 especially when there are many gaps and set is small. 尤其是当间隙很多且设置很小时。 this worked very well for me(in php): 这对我来说很好(在php中):

$count = $collection->count($search);
$skip = mt_rand(0, $count - 1);
$result = $collection->find($search)->skip($skip)->limit(1)->getNext();

#6楼

Using Map/Reduce, you can certainly get a random record, just not necessarily very efficiently depending on the size of the resulting filtered collection you end up working with. 使用Map / Reduce,您当然可以得到一个随机记录,但不一定非常有效,这取决于最终使用的结果过滤后的集合的大小。

I've tested this method with 50,000 documents (the filter reduces it to about 30,000), and it executes in approximately 400ms on an Intel i3 with 16GB ram and a SATA3 HDD... 我已经用50,000个文档测试了此方法(过滤器将其减少到大约30,000个),并且在具有16GB内存和SATA3 HDD的Intel i3上,它可以在大约400毫秒内执行...

db.toc_content.mapReduce(
    /* map function */
    function() { emit( 1, this._id ); },

    /* reduce function */
    function(k,v) {
        var r = Math.floor((Math.random()*v.length));
        return v[r];
    },

    /* options */
    {
        out: { inline: 1 },
        /* Filter the collection to "A"ctive documents */
        query: { status: "A" }
    }
);

The Map function simply creates an array of the id's of all documents that match the query. Map函数只是创建一个与查询匹配的所有文档ID的数组。 In my case I tested this with approximately 30,000 out of the 50,000 possible documents. 就我而言,我用50,000个可能的文档中的大约30,000个进行了测试。

The Reduce function simply picks a random integer between 0 and the number of items (-1) in the array, and then returns that _id from the array. Reduce函数仅选择一个介于0和数组中项数(-1)之间的随机整数,然后从数组中返回该_id

400ms sounds like a long time, and it really is, if you had fifty million records instead of fifty thousand, this may increase the overhead to the point where it becomes unusable in multi-user situations. 400ms听起来很长一段时间,实际上,如果您有五千万条记录而不是五万条记录,这可能会将开销增加到在多用户情况下变得无法使用的程度。

There is an open issue for MongoDB to include this feature in the core... https://jira.mongodb.org/browse/SERVER-533 MongoDB在核心中包含此功能存在一个未解决的问题... https://jira.mongodb.org/browse/SERVER-533

If this "random" selection was built into an index-lookup instead of collecting ids into an array and then selecting one, this would help incredibly. 如果将这种“随机”选择内置到索引查找中,而不是将id收集到一个数组中然后选择一个,那么这将非常有用。 (go vote it up!) (去投票吧!)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章