MongoDB Map Reduce
MAP REDUCE is a computing model, which simply means that a large number of work (data) are decomposed (MAP) and executed, and then the results are combined into the final result (REDUCE).
The map reduce provided by MongoDB is very flexible and practical for large-scale data analysis.
MapReduce command
The following is the basic syntax of MapReduce:
>db.collection.mapReduce( function() {emit(key,value);}, //map function function(key,values) {return reduceFunction}, //reduce function { out: collection, query: document, sort: document, limit: number } )
Using MapReduce, you need to implement two functions: the Map function and the Reduce function. The Map function calls emit(key, value), traverses all records in the collection, and passes the key and value to the Reduce function for processing.
The Map function must call emit(key, value) to return a key value pair.
Parameter Description:
- map: mapping function (generating a sequence of key value pairs as parameters of the reduce function).
- Reduce statistics function. The task of the reduce function is to change key values into key values, that is, to change the values array into a single value..
- out statistics results are stored in a collection (if not specified, a temporary collection is used and automatically deleted after the client is disconnected).
- Query is a filter condition. Only documents that meet the condition will call the map function. (query. limit, sort can be combined at will)
- The sort sorting parameter combined with sort and limit (which also sorts documents before sending them to the map function) can optimize the grouping mechanism
- Limit the maximum number of documents sent to the map function (if there is no limit, it is not useful to use sort alone)
Using MapReduce
Consider the following document structure to store the user's article, and the document stores the user's user_name and status fields of the article:
>db.posts.insert({ "post_text": "Rookie tutorial, the most complete technical documentation.", "user_name": "mark", "status":"active" }) WriteResult({ "nInserted" : 1 }) >db.posts.insert({ "post_text": "Rookie tutorial, the most complete technical documentation.", "user_name": "mark", "status":"active" }) WriteResult({ "nInserted" : 1 }) >db.posts.insert({ "post_text": "Rookie tutorial, the most complete technical documentation.", "user_name": "mark", "status":"active" }) WriteResult({ "nInserted" : 1 }) >db.posts.insert({ "post_text": "Rookie tutorial, the most complete technical documentation.", "user_name": "mark", "status":"active" }) WriteResult({ "nInserted" : 1 }) >db.posts.insert({ "post_text": "Rookie tutorial, the most complete technical documentation.", "user_name": "mark", "status":"disabled" }) WriteResult({ "nInserted" : 1 }) >db.posts.insert({ "post_text": "Rookie tutorial, the most complete technical documentation.", "user_name": "runoob", "status":"disabled" }) WriteResult({ "nInserted" : 1 }) >db.posts.insert({ "post_text": "Rookie tutorial, the most complete technical documentation.", "user_name": "runoob", "status":"disabled" }) WriteResult({ "nInserted" : 1 }) >db.posts.insert({ "post_text": "Rookie tutorial, the most complete technical documentation.", "user_name": "runoob", "status":"active" }) WriteResult({ "nInserted" : 1 })
Now, we will use the mapReduce function in the posts collection to select the published articles (status: "active") and use user_name group to calculate the number of articles per user:
>db.posts.mapReduce( function() { emit(this.user_name,1); }, function(key, values) {return Array.sum(values)}, { query:{status:"active"}, out:"post_total" } )
The above mapReduce output results are:
{ "result" : "post_total", "timeMillis" : 23, "counts" : { "input" : 5, "emit" : 5, "reduce" : 1, "output" : 2 }, "ok" : 1 }
The results show that there are four documents that meet the query criteria (status: "active"), and four key value pairs are generated in the map function. Finally, the same key values are divided into two groups by using the reduce function.
Specific parameter description:
- result: the name of the collection where the results are stored. This is a temporary collection. After the MapReduce connection is closed, it is automatically deleted.
- timeMillis: the time taken to execute, in milliseconds
- input: the number of documents that meet the conditions and are sent to the map function
- Emit: the number of times emit is called in the map function, that is, the total amount of data in all collections
- Output: number of documents in the result set * * (count is very helpful for debugging)**
- ok: success or not. Success is 1
- err: if you fail, there can be reasons for failure, but from experience, the reasons are vague and have little effect
Use the find operator to view mapReduce query results:
>db.posts.mapReduce( function() { emit(this.user_name,1); }, function(key, values) {return Array.sum(values)}, { query:{status:"active"}, out:"post_total" } ).find()
The above query shows the following results. Two users tom and mark have two published articles:
{ "_id" : "mark", "value" : 4 } { "_id" : "runoob", "value" : 1 }
In a similar way, MapReduce can be used to build large and complex aggregate queries.
The Map function and Reduce function can be implemented using JavaScript, which makes the use of MapReduce very flexible and powerful.
MongoDB aggregation
Aggregation in MongoDB is mainly used to process data (such as statistical average, summation, etc.) and return calculated data results. It is similar to count(*) in sql statement.
aggregate() method
The method of aggregation in MongoDB uses aggregate().
grammar
The basic syntax format of the aggregate() method is as follows:
>db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)
example
The data in the collection is as follows:
{ _id: ObjectId(7df78ad8902c) title: 'MongoDB Overview', description: 'MongoDB is no sql database', by_user: 'w3cschool.cc', url: 'http://www.w3cschool.cc', tags: ['mongodb', 'database', 'NoSQL'], likes: 100 }, { _id: ObjectId(7df78ad8902d) title: 'NoSQL Overview', description: 'No sql database is very fast', by_user: 'w3cschool.cc', url: 'http://www.w3cschool.cc', tags: ['mongodb', 'database', 'NoSQL'], likes: 10 }, { _id: ObjectId(7df78ad8902e) title: 'Neo4j Overview', description: 'Neo4j is no sql database', by_user: 'Neo4j', url: 'http://www.neo4j.com', tags: ['neo4j', 'database', 'NoSQL'], likes: 750 },
Now let's calculate the number of articles written by each author through the above set, and use aggregate() to calculate the results as follows:
> db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : 1}}}]) { "result" : [ { "_id" : "w3cschool.cc", "num_tutorial" : 2 }, { "_id" : "Neo4j", "num_tutorial" : 1 } ], "ok" : 1 } >
The above example is similar to the sql statement: select by_user, count(*) from mycol group by by_user
In the above example, we use the field by_ The user field groups the data and calculates by_ The sum of the same values for the user field.
The following table shows some aggregate expressions:
expression | describe | example |
---|---|---|
$sum | Calculate the sum. | db.mycol.aggregate([{KaTeX parse error: Expected '}', got 'EOF' at end of input: ...roup : {_id : "by_user", num_tutorial : { s u m : " sum : " sum:"likes"}}}]) |
$avg | Calculate average | db.mycol.aggregate([{KaTeX parse error: Expected '}', got 'EOF' at end of input: ...roup : {_id : "by_user", num_tutorial : { a v g : " avg : " avg:"likes"}}}]) |
$min | Gets the minimum value of the corresponding value of all documents in the collection. | db.mycol.aggregate([{KaTeX parse error: Expected '}', got 'EOF' at end of input: ...roup : {_id : "by_user", num_tutorial : { m i n : " min : " min:"likes"}}}]) |
$max | Gets the maximum corresponding value of all documents in the collection. | db.mycol.aggregate([{KaTeX parse error: Expected '}', got 'EOF' at end of input: ...roup : {_id : "by_user", num_tutorial : { m a x : " max : " max:"likes"}}}]) |
$push | Inserts values into an array in the result document. | db.mycol.aggregate([{KaTeX parse error: Expected '}', got 'EOF' at end of input: ...roup : {_id : "by_user", url : { p u s h : " push: " push:"url"}}}]) |
$addToSet | Inserts values into an array in the result document, but does not create a copy. | db.mycol.aggregate([{KaTeX parse error: Expected '}', got 'EOF' at end of input: ...roup : {_id : "by_user", url : { a d d T o S e t : " addToSet : " addToSet:"url"}}}]) |
$first | Get the first document data according to the sorting of resource documents. | db.mycol.aggregate([{KaTeX parse error: Expected '}', got 'EOF' at end of input: ...roup : {_id : "by_user", first_url : { f i r s t : " first : " first:"url"}}}]) |
$last | Get the last document data according to the sorting of resource documents | db.mycol.aggregate([{KaTeX parse error: Expected '}', got 'EOF' at end of input: ...roup : {_id : "by_user", last_url : { l a s t : " last : " last:"url"}}}]) |
Concept of pipeline
Pipes are generally used in Unix and Linux to take the output of the current command as the parameter of the next command.
The MongoDB aggregation pipeline passes the MongoDB document to the next pipeline for processing after one pipeline is processed. Pipeline operations can be repeated.
Expressions: process input documents and output. The expression is stateless. It can only be used to calculate the document of the current aggregation pipeline and cannot process other documents.
Here we introduce some common operations in the aggregation framework:
- $project: modify the structure of the input document. It can be used to rename, add, or delete fields, or to create calculation results and nested documents.
- m a t c h : use to too filter number according to , only transport Out symbol close strip piece of writing files . match: used to filter data and output only qualified documents. match: used to filter data and output only qualified documents. match uses MongoDB's standard query operations.
- $limit: used to limit the number of documents returned by the MongoDB aggregation pipeline.
- $skip: skip the specified number of documents in the aggregation pipeline and return the remaining documents.
- $unwind: split an array type field in the document into multiple pieces, each containing a value in the array.
- $group: groups documents in the collection, which can be used to count results.
- $sort: sort the input documents and output them.
- $geoNear: output ordered documents close to a geographic location.
Pipe operator instance
1. $project instance
db.article.aggregate( { $project : { title : 1 , author : 1 , } });
In this case, there will only be_ id, tile, and author are three fields. By default_ The id field is included. If you want to exclude it_ If yes, it can be as follows:
db.article.aggregate( { $project : { _id : 0 , title : 1 , author : 1 } });
2.$match instance
db.articles.aggregate( [ { $match : { score : { $gt : 70, $lte : 90 } } }, { $group: { _id: null, count: { $sum: 1 } } } ] );
m a t c h use to Obtain take branch number large to 70 Small to or etc. to 90 remember record , however after take symbol close strip piece of remember record give reach lower one rank paragraph match is used to obtain records with scores greater than 70 and less than or equal to 90, and then send qualified records to the next stage match is used to obtain records with scores greater than 70 and less than or equal to 90, and then send qualified records to the next stage for processing.
3.$skip instance
db.article.aggregate( { $skip : 5 });
After being processed by the $skip pipeline operator, the first five documents are "filtered out".