Analysing reddit data - part 3: cleaning and describing the data

written December 02, 2015 in python, programming tips, public data, reddit api, pandas

Over the past two weeks (here and here) we have been discussing how to use JSON-encoded data from reddit. So far we have set up our environment and extracted the top 1,000 posts of all time from the subreddit /r/relationships into a pandas Dataframe. This week, we will work on cleaning the data, extracting further data from our existing variables and describing these variables. We’ll end this series next week by doing some basic inferential analyses.

Picking up where we left off

Last week, we ended up with a pandas Dataframe called rel_df with five variables: Date, Title, Flair, Comments and Score. If you don’t have this Dataframe prepared, you’ll need to go back to the previous posts and set this up to continue with the tutorial. Here is the first 5 results from this Dataframe.

rel_df[:5]

	Date	Title	Flair	Comments	Score
0	1440187622	[UPDATE]My [26 F] with my husband [29 M] 1 yea…	Updates	908	7843
1	1438962646	Update: I [30 F] am sitting in the back of my …	◉ Locked Post ◉	631	6038
2	1435026034	UPDATE: My fiancee (24F) has no bridesmaids an…	Updates	623	5548
3	1438393090	My [42M] daughter [17F] has been bullying a gi…	◉ Locked Post ◉	970	5301
4	1440543117	[Update] My [26F] fiance’s [28M] ex-wife [28F]…	Updates	767	5195

Cleaning the data

As I pointed out in the last blog post, there are two immediately obvious issues with these data. The first is that the date is in Unix or Epoch time, which represents the number of seconds that have passed since 1 January 1970. In order to convert this into a datetime format, we run the following:

rel_df['Date'] = pd.to_datetime((rel_df['Date'].values*1e9).astype(int))
rel_df[:5]

	Date	Title	Flair	Comments	Score
0	2015-08-21 20:07:02	[UPDATE]My [26 F] with my husband [29 M] 1 yea…	Updates	908	7843
1	2015-08-07 15:50:46	Update: I [30 F] am sitting in the back of my …	◉ Locked Post ◉	631	6038
2	2015-06-23 02:20:34	UPDATE: My fiancee (24F) has no bridesmaids an…	Updates	623	5548
3	2015-08-01 01:38:10	My [42M] daughter [17F] has been bullying a gi…	◉ Locked Post ◉	970	5301
4	2015-08-25 22:51:57	[Update] My [26F] fiance’s [28M] ex-wife [28F]…	Updates	767	5195

The other issue is that when posts become locked by the subreddit moderators, the original flair is replaced with “Locked Post”. This is not really the most useful label as it doesn’t give us any information about the topic. Let’s replace all of the “Locked Post” flairs with missing values (NaN) and have a look at how many there are.

import re

replace_value = rel_df['Flair'][1]
rel_df['Flair'] = rel_df['Flair'].replace(replace_value, np.nan)

rel_df['Flair'].isnull().sum()

You can see a substantial number (16%) of flairs were replaced with “Locked Post”, which means we have a large amount of missing data in this variable. However, we can recover some information by exploiting the fact that update posts usually have the word “Update” in the title. We can use a regex to check for whether “Update” is in the title, and if so, replace the flair with “Updates”.

cond1 = rel_df['Title'].str.contains(
    '^\[?[a-z!?A-Z ]*UPDATE\]?:?', flags = re.IGNORECASE)
cond2 = rel_df['Flair'].isnull()

rel_df.loc[(cond1 & cond2), 'Flair'] = rel_df.loc[(cond1 & cond2), 'Flair'].replace(np.nan, 'Updates')
rel_df[:5]

	Date	Title	Flair	Comments	Score
0	2015-08-21 20:07:02	[UPDATE]My [26 F] with my husband [29 M] 1 yea…	Updates	908	7843
1	2015-08-07 15:50:46	Update: I [30 F] am sitting in the back of my …	Updates	631	6038
2	2015-06-23 02:20:34	UPDATE: My fiancee (24F) has no bridesmaids an…	Updates	623	5548
3	2015-08-01 01:38:10	My [42M] daughter [17F] has been bullying a gi…	NaN	970	5301
4	2015-08-25 22:51:57	[Update] My [26F] fiance’s [28M] ex-wife [28F]…	Updates	767	5195

rel_df['Flair'].isnull().sum()

You can see that we still have 10% of missing data in this variable, which is quite high. However, we have replaced some of the missing data in a robust manner. This should be bookmarked as a possible source of bias in our data when we try and interpret our analyses.

Extracting extra variables

You might have gotten the idea from how I replaced the missing flairs that we could extract further information from the Title variable, and indeed we can. Another convention that we can exploit is that posters to /r/relationships are asked to include their age and sex in the title. For example, you can see in the first post that the poster has included their age and sex as “[26F]”, indicating they are a 26 year old woman. You can probably also see a pattern in how the posters information is nested in the title as well. Looking through the data, I picked out four words that precede the posters’ information: “My”, “I”, “I’m” and “Me”. We can use a (pretty complicated) regex to extract this portion of the title:

poster_age_sex = rel_df['Title'].str.extract(
    "((i\'m|i|my|me)\s?(\[|\()(m|f)?(\s|/)?[0-9]{1,2}(\s|/)?([m,f]|male|female)?(\]|\)))", 
        flags = re.IGNORECASE)[0]
poster_age_sex[:5]

0    My [26 F]
1     I [30 F]
2      I (25m)
3     My [42M]
4     My [26F]
Name: 0, dtype: object

Let’s now clean this up by getting rid of the starting word, then pulling the age and sex out into separate variables and adding them to the DataFrame.

poster_age_sex = poster_age_sex.str.replace("((i\'m|i|my|me))\s?", "", flags = re.IGNORECASE)
poster_age = poster_age_sex.str.extract('([0-9]{1,2})')
poster_sex = poster_age_sex.str.extract('([m,f])', flags = re.IGNORECASE)

rel_df['PosterAge'] = pd.to_numeric(poster_age)
rel_df['PosterSex'] = poster_sex.str.upper()

rel_df[:5]

	Date	Title	Flair	Comments	Score	PosterAge	PosterSex
0	2015-08-21 20:07:02	[UPDATE]My [26 F] with my husband [29 M] 1 yea…	Updates	908	7843	26	F
1	2015-08-07 15:50:46	Update: I [30 F] am sitting in the back of my …	Updates	631	6038	30	F
2	2015-06-23 02:20:34	UPDATE: My fiancee (24F) has no bridesmaids an…	Updates	623	5548	25	M
3	2015-08-01 01:38:10	My [42M] daughter [17F] has been bullying a gi…	NaN	970	5301	42	M
4	2015-08-25 22:51:57	[Update] My [26F] fiance’s [28M] ex-wife [28F]…	Updates	767	5195	26	F

Let’s now check for missing values in our new PosterAge and PosterSex variables:

rel_df['PosterAge'].isnull().sum()

rel_df['PosterSex'].isnull().sum()

Again, these variables have fairly high amounts of missing data (9% for age 10% for sex). This is another possible source of bias to keep in mind. I’ll discuss how these possible biases might affect how we interpret our analyses at the end of next week’s post.

Finally, we can use the date variable to obtain the day of the week that the post was created:

rel_df['DayOfWeek'] = rel_df['Date'].dt.dayofweek
days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri',
        5: 'Sat', 6: 'Sun'}
rel_df['DayOfWeek'] = rel_df['DayOfWeek'].apply(lambda x: days[x])

rel_df[:5]

	Date	Title	Flair	Comments	Score	PosterAge	PosterSex	DayOfWeek
0	2015-08-21 20:07:02	[UPDATE]My [26 F] with my husband [29 M] 1 yea…	Updates	908	7843	26	F	Fri
1	2015-08-07 15:50:46	Update: I [30 F] am sitting in the back of my …	Updates	631	6038	30	F	Fri
2	2015-06-23 02:20:34	UPDATE: My fiancee (24F) has no bridesmaids an…	Updates	623	5548	25	M	Tues
3	2015-08-01 01:38:10	My [42M] daughter [17F] has been bullying a gi…	NaN	970	5301	42	M	Sat
4	2015-08-25 22:51:57	[Update] My [26F] fiance’s [28M] ex-wife [28F]…	Updates	767	5195	26	F	Tues

Checking for missing data, we find that this time no values are missing (due to the fact that every date is present).

rel_df['DayOfWeek'].isnull().sum()

Descriptives

Now that we have finished cleaning, we are now ready to get acquainted with our data using some descriptives. For the sake of brevity I will skip screening for normality and assume all of the continuous variables are non-normal, but obviously in a real analysis it would be necessary to explore this further. As such, I will use the median and interquartile range (IQR) for continuous variables, and frequencies and percentages for categorical variables.

Poster age

Let’s start with PosterAge. We can see we have 909 data points for this variable, and that posters are a median of 26 years old (IQR: 23, 29).

rel_df['PosterAge'].describe()

count    909.000000
mean      26.698570
std        6.323285
min       13.000000
25%       23.000000
50%       26.000000
75%       29.000000
max       57.000000
Name: PosterAge, dtype: float64

Poster sex

Looking at PosterSex, we can see that we have 897 data points for this variable. 542 of the posters are female (60% of non-missing values), and 355 are male (40%).

rel_df['PosterSex'].notnull().sum()

rel_df['PosterSex'].value_counts()

F    542
M    355
Name: PosterSex, dtype: int64

100 * rel_df['PosterSex'].value_counts() / rel_df['PosterSex'].notnull().sum()

F    60.423634
M    39.576366
Name: PosterSex, dtype: float64

Flairs

In Flairs, we have 904 complete data points. The most common flair is “Updates” (516 posts, or 57%), and the least common is “Dating” (3, > 1%). The bottom three categories are concerningly small and are therefore unlikely to be suitable for further analysis, especially when we get to doing subgroup analyses next week (see this previous blog post for a discussion on the importance of sufficiently large samples).

rel_df['Flair'].notnull().sum()

rel_df['Flair'].value_counts()

Updates            516
Relationships      161
Non-Romantic       158
Infidelity          38
Breakups            15
Personal issues     13
Dating               3
Name: Flair, dtype: int64

100 * rel_df['Flair'].value_counts() / rel_df['Flair'].notnull().sum()

Updates            57.079646
Relationships      17.809735
Non-Romantic       17.477876
Infidelity          4.203540
Breakups            1.659292
Personal issues     1.438053
Dating              0.331858
Name: Flair, dtype: float64

Score

Examining the Score variable we can see that it has all 1,000 data points, and the median score per post is 1,225 (IQR: 961, 1,761).

rel_df['Score'].describe()

count    1000.00000
mean     1511.58000
std       822.78436
min       792.00000
25%       963.00000
50%      1224.50000
75%      1762.00000
max      7843.00000
Name: Score, dtype: float64

Comments

Similarly, the Comments variable has all 1,000 data points. The median number of comments per post is 269 (IQR: 161, 421).

rel_df['Comments'].describe()

count    1000.000000
mean      318.964000
std       219.461632
min        15.000000
25%       161.000000
50%       269.000000
75%       421.250000
max      1693.000000
Name: Comments, dtype: float64

Day of week

Finally, let’s have a look at DayOfWeek. We already know it has all 1,000 data points, so we don’t have to check that again. We can see that the highest number of posts were created during the week, with around 15% of posts on each of the weekdays. In contrast, Sunday was the quietest day for popular posts.

rel_df['DayOfWeek'].value_counts()

Tues     156
Weds     155
Mon      155
Thurs    154
Fri      148
Sun      121
Sat      111
Name: DayOfWeek, dtype: int64

100 * rel_df['DayOfWeek'].value_counts() / rel_df['DayOfWeek'].notnull().sum()

Tues     15.6
Weds     15.5
Mon      15.5
Thurs    15.4
Fri      14.8
Sun      12.1
Sat      11.1
Name: DayOfWeek, dtype: float64

We now have a cleaned dataset and have inspected each of the variables (although be aware I took some shortcuts with my screening and didn’t inspect things like normality). We are now ready to run some analyses next week.

For those who have followed this tutorial so far, or have been reading my blog more generally, a huge thank you! Today marks just past 3 months of blogging for me, and it has been wonderful to have an excuse to constantly learn new data science skills and to share them with others. I hope my posts have helped you to learn something just as I have learned, and continue to learn from all of the wonderful data science and programming bloggers out there in turn.