Notes for MongoDB
08 Apr 2016Notes for Data Wrangling with MongoDB from Udacity
We need to assess our data to:
- Test assumptions about: values, data types, and shape
- Identity errors or outliers
- Find missing values
CSV is lightweight:
- Each line of text is a single row
- Fileds are seperated by a delimeter
- Just the data itself
- Don’t need special software
- All spreadsheet apps read/write CSV
Data Modeling in JSON
- Items may have different fields
- May have nested objects
- May have nested arrays
XML desgin goals
- Data transfer
- Easy to write code to read/write
- Document valiation
- Human readable
- Supports a wide variety applications
Measures of data quality
- Validity: conforms to a schema
- Accuracy: conforms to gold standard
- Completeness: all records
- Consistency: matches other data
- Uniformity: same units
Blueprint for cleaning data
- Audit your data
- Create a data cleaning plan
- Execute the plan
- Manually correct
Inequality operators for MongoDB: $gt, $lt, $gte, $lte, $ne.
# Find cities with population > 2500, but <= 50000.
# Ineuqality operators also work for strings and dates.
query = {"population": {"$gt": 2500, "$lte", 50000}}
cities = db.cities.find(query)
# Update all documents in database
city = db.cities.update({"name": "Munchen", "country": "Germany"},
{"$set": {"isoCountryCode": "DEU"}},
multi=True)
Other MongoDB operators:
- $in: if in certain sets
- $exists: if certain value exists or not
- $all: must contain all the values
- $and: two conditions must all satisfy
In MongoDB shell:
# First part is query condition, second part is projection for output format.
db.tweets.find({"entities.hashtags": {"$ne": []}}, {"entities.hashtags.text": 1, "_id": 0})
# remove all cities that don't have names
db.cities.remove({"name": {"$exists": 0}})
Aggregation framework in MongoDB
Example: who tweeted most?
- Group Tweents by user
- Count each users tweets
- Sort into descending order
- Select user at top
# realize the above process through pipeline.
result = db.tweets.aggregate([
{"$group": {"_id": "$user.screen_name", "count": {"$sum": 1}}},
{"$sort": {"count": -1}}])
Use $project to:
- Include fields from the original document
- Insert computed fields
- Rename fields
- Craete fields that hold subdocuments
# Question: who has the highest followers_to_friends ratio?
# Match operator
result = db.tweets.aggregate([
{"$match": {"user.friends_count": {"$gt": 0},
"user.followers_count": {"$gt": 0}}},
{"$project": {"ratio": {"$divide": ["$user.followers_count",
"$user.friends_count"]},
"screen_name": "$user.screen_name"}},
{"$sort": {"ratio": -1}},
{"$limit": 1}])
Use $unwind :
result = db.tweets.aggregate([
{"$unwind": "$entities.user_mentions"},
{"$group": {"_id": "$user.screen_name",
"count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 1}])
$group operators: $sum, $first, $last, $max, $min, $avg.
result = db.tweets.aggregate([
{"$unwind": "$entities.hashtags"},
{"$group": {"_id": "$entities.hastags.text",
"retweet_avg": {"$avg": "$retwet_count"}}},
{"$sort": {"retweet_avg": -1}},
{"$limit": 1}])
Array operators: $push, $addToSet
result = db.tweets.aggregate([
{"$unwind": "$entities.hashtags"},
{"$group": {"_id": "$user.screen_name",
"unique_hashtags": {"$addToSet": "$entities.hashtags.text"}}},
{"$sort": {"_id": -1}},
More aggregation operators can be found here.