Life after conflict

Since the conflict erupted in 2012, children in Mali continue to face significant threats, including losing their homes, families, and schools. Aminata Haidara, 14, is one of those children. She fled…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




The Data

Part 1 of 3: Data cleaning & pre-processing

*This assumes you have familiarity with Python

In an attempt to address these questions, we decided to build a machine learning model to analyze whether or not seeking treatment for mental health might be more contingent on company characteristics or individual characteristics. This series will provide an overview of our project, and this first post will focus on data cleaning & pre-processing. If you’re here to learn about our findings and would like to follow along, feel free to do so using the dataset linked below! (I recommend Jupyter Notebook or Deepnote)

My hope is that by the end of this three-part series, you will be able to understand how we came to our conclusions and even build your own prediction model to address questions you’re curious about!

First thing’s first, of course. The Data. To approach these questions, we are using mental health data from surveys run by Open Sourcing Mental Illness (OSMI) in 2016. Let’s start by importing everything we need for this section.

Imports

To view the first five columns of your data render as a table, you can insert the following:

This will give you something similar to the table below, and you should be able to scroll horizontally to view all columns.

To get a better look at all of the questions asked in the survey, insert this into the next cell:

This will give you a list of the DataFrame’s columns. It might look something like this:

As mentioned above, the question we want to dive deeper into is whether or not someone has sought treatment for a mental health issue. Fortunately, the answers to that question are already in the form of binary data. For an overview of how many people answered no and how many people answered yes, we can call the groupby() function, which splits the data into groups based on all the unique answers. In our case, people who answered 0 indicate that they did have not sought treatment while people who answered 1 indicate that they have sought treatment.

The output should look something like this:

839 of the respondents stated that they have received treatment for a mental health issue, while 594 have not.

Now for the fun part. A large (I cannot emphasize how large) majority of machine learning is data cleaning and pre-processing.

For ease in working with the data later, columns will be renamed to shorter / more succinct labels.

First, we’re going to drop anyone who is self-employed as we are particularly interested in addressing mental health in tech companies.

Next, we’re going to drop some questions that we don’t believe add substantial information. We’ve decided to drop “Country Live” and “State Live” since we only care about where each respondent works, and we will also drop the “why not” columns as those cannot be encoded. We will also drop “MH Interference Treatment” and “MH Interference No Treatment” because our group is interested in whether or not mental health is more prevalent in certain demographics, and less concerned about treatment. Finally, since there is a lot of missing data, we will drop any column that is more than 50% empty.

To check for columns that are more than half empty, I looped through every column in the DataFrame and checked to see if the number of null values in each column was greater than half the size of the DataFrame.

This list represents all columns with more than 50% null values. Knowing this, we can now manually drop these columns, in addition to the questions we discussed above.

Dealing with missing or seemingly incorrect data depends on the type of data you’re working with. If you’re dealing with categorical data, for example, you might want to fill missing data with the mode.

To start off, you can insert the line of code below to view a table showing you how many NaN values are in each column:

Let’s take a look at the “State” column and see if there are many missing values. To view the count of NaN values for this specific column, you can use the isnull() and sum() functions.

429 missing values is an alarming number. When taking a closer look at the data, we came to the conclusion that those who left this question blank likely live outside the United States. We can then fill all NaN values with 0 to indicate non-US respondents. (This number will make more sense when we get to the next section!)

To address another example, let’s take a closer look at the “Age” column. In our case, we found that some individuals might have inputted incorrect data. When checking for extreme values in the Age column, we found that the minimum value was 3 and the maximum value was 323. Making the assumption that these were mistakes (I sure hope no 3-year-old is working full-time, and I would think the existence of a 323-year-old would be more recognized), we can calculate the mean age and replace the extreme ages with that value.

Finally, we will be addressing a very common obstacle in data cleaning; categorical data. Machine learning algorithms expect numerical input data. In order to avoid leaving categorical information out, we can create dummy variables to convert the data into numeric values.

Let’s think back to the “State” column. The data consists of string values (e.g. “Maine”, “New Hampshire”). Recall that we have already filled all NaN values with 0 in the previous step. What we can do now is split up the states into 4 main regions: Northeast, Midwest, South, and West. Once split up, we can then assign each group to a number. Let’s write a function for that.

We can then use the apply() function to apply this to every row in the “State” column:

Now, if we take a look at the values in the “State” column, they should all be encoded and appropriately categorized!

We have 429 respondents outside the US, 221 respondents in the midwest, 218 in the west, 167 in the south, and 108 in the northeast. Now the algorithm can understand this data the way we want it to! Woohoo!

This concludes part 1. Congratulations — you have overcome data pre-processing, an absolutely vital portion of the machine learning journey! Look out for part 2 during which we will explore cross-validation and finally build our actual machine learning model!

Add a comment

Related posts:

A New Chapter Awaits.

Four deaths. One month. One family. November is difficult. All four of my grandparents died this month — during different years thank goodness. This story isn’t supposed to be macabre or sad, but in…