來自MongoDB的隨機記錄

本文翻譯自:Random record from MongoDB

I am looking to get a random record from a huge (100 million record) mongodb . 我希望從一個巨大的(1億條記錄)的mongodb獲得隨機記錄。

What is the fastest and most efficient way to do so? 最快,最有效的方法是什麼? The data is already there and there are no field in which I can generate a random number and obtain a random row. 數據已經存在,並且沒有可以在其中生成隨機數並獲得隨機行的字段。

Any suggestions? 有什麼建議麼?


#1樓

參考:https://stackoom.com/question/Bqgv/來自MongoDB的隨機記錄


#2樓

If you have a simple id key, you could store all the id's in an array, and then pick a random id. 如果您有一個簡單的ID鍵,則可以將所有ID存儲在一個數組中,然後選擇一個隨機ID。 (Ruby answer): (Ruby答案):

ids = @coll.find({},fields:{_id:1}).to_a
@coll.find(ids.sample).first

#3樓

When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. 當我遇到類似的解決方案時,我回溯並發現業務請求實際上是爲了對要顯示的庫存進行某種形式的輪換。 In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB. 在這種情況下,有更好的選擇,這些選擇可以從諸如Solr之類的搜索引擎獲得答案,而不是來自諸如MongoDB之類的數據存儲。

In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. 簡而言之,由於需要“智能旋轉”內容,我們應該做的是代替個人q得分修飾符,而不是對所有文檔使用隨機數。 To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute aq score modifier. 要自己實施此操作(假設用戶數量很少),您可以爲每個用戶存儲一個文檔,該文檔具有productId,展示次數,點擊次數,上次查看日期以及企業發現對計算aq分數有意義的任何其他因素修飾符。 When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory). 檢索要顯示的集合時,通常您從數據存儲中請求的文檔要比最終用戶請求的要多,然後應用q得分修飾符,獲取最終用戶請求的記錄數,然後將結果頁面隨機化設置,因此只需對應用程序層(內存中)中的文檔進行排序。

If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user. 如果用戶範圍太大,則可以將用戶分類爲行爲組,然後按行爲組而不是用戶進行索引。

If the universe of products is small enough, you can create an index per user. 如果產品範圍足夠小,則可以爲每個用戶創建一個索引。

I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution. 我發現該技術效率更高,但更重要的是,在創建使用軟件解決方案的相關有價值的經驗方面,效率更高。


#4樓

If you're using mongoid, the document-to-object wrapper, you can do the following in Ruby. 如果您使用的是文檔到對象包裝器mongoid,則可以在Ruby中執行以下操作。 (Assuming your model is User) (假設您的模型是用戶)

User.all.to_a[rand(User.count)]

In my .irbrc, I have 在我的.irbrc中,我有

def rando klass
    klass.all.to_a[rand(klass.count)]
end

so in rails console, I can do, for example, 因此在Rails控制檯中,我可以執行例如

rando User
rando Article

to get documents randomly from any collection. 從任何集合中隨機獲取文檔。


#5樓

non of the solutions worked well for me. 沒有一種解決方案適合我。 especially when there are many gaps and set is small. 尤其是當間隙很多且設置很小時。 this worked very well for me(in php): 這對我來說很好(在php中):

$count = $collection->count($search);
$skip = mt_rand(0, $count - 1);
$result = $collection->find($search)->skip($skip)->limit(1)->getNext();

#6樓

Using Map/Reduce, you can certainly get a random record, just not necessarily very efficiently depending on the size of the resulting filtered collection you end up working with. 使用Map / Reduce,您當然可以得到一個隨機記錄,但不一定非常有效,這取決於最終使用的結果過濾後的集合的大小。

I've tested this method with 50,000 documents (the filter reduces it to about 30,000), and it executes in approximately 400ms on an Intel i3 with 16GB ram and a SATA3 HDD... 我已經用50,000個文檔測試了此方法(過濾器將其減少到大約30,000個),並且在具有16GB內存和SATA3 HDD的Intel i3上,它可以在大約400毫秒內執行...

db.toc_content.mapReduce(
    /* map function */
    function() { emit( 1, this._id ); },

    /* reduce function */
    function(k,v) {
        var r = Math.floor((Math.random()*v.length));
        return v[r];
    },

    /* options */
    {
        out: { inline: 1 },
        /* Filter the collection to "A"ctive documents */
        query: { status: "A" }
    }
);

The Map function simply creates an array of the id's of all documents that match the query. Map函數只是創建一個與查詢匹配的所有文檔ID的數組。 In my case I tested this with approximately 30,000 out of the 50,000 possible documents. 就我而言,我用50,000個可能的文檔中的大約30,000個進行了測試。

The Reduce function simply picks a random integer between 0 and the number of items (-1) in the array, and then returns that _id from the array. Reduce函數僅選擇一個介於0和數組中項數(-1)之間的隨機整數,然後從數組中返回該_id

400ms sounds like a long time, and it really is, if you had fifty million records instead of fifty thousand, this may increase the overhead to the point where it becomes unusable in multi-user situations. 400ms聽起來很長一段時間,實際上,如果您有五千萬條記錄而不是五萬條記錄,這可能會將開銷增加到在多用戶情況下變得無法使用的程度。

There is an open issue for MongoDB to include this feature in the core... https://jira.mongodb.org/browse/SERVER-533 MongoDB在覈心中包含此功能存在一個未解決的問題... https://jira.mongodb.org/browse/SERVER-533

If this "random" selection was built into an index-lookup instead of collecting ids into an array and then selecting one, this would help incredibly. 如果將這種“隨機”選擇內置到索引查找中,而不是將id收集到一個數組中然後選擇一個,那麼這將非常有用。 (go vote it up!) (去投票吧!)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章