Bruno Arine

5 Traits that Predict Substance Abuse

Substance abuse hurts. Not only those who are close, but society at large. In 2017, nearly 20 million American adults (almost 7% of the population) battled a substance abuse problem, costing the US more than $740 billion annually in lost workplace productivity, healthcare expenses, and crime-related costs, according to the American Addiction Centers.

Substance abuse has a cure, but many people are too ashamed to reach out for help, which makes the problem even harder to address.

What if we could tell whether someone has a potential problem with drugs without asking them explicitly? Maybe we could look out for common traits among people who suffer with substance abuse problems and provide aid in advance. Organizations could target ad campaigns at the right people and right moment and increase their effectiveness.

But what would these common traits be?

Looking for a pattern

This study will try to answer the following questions:

  1. Which age group shows the largest percentage of people with a possible substance addiction problem? And does substance abuse disproportionately affect any gender?

  2. Which kind profession is more vulnerable to substance abuse?

  3. Are people with kids or partners less likely to have a substance abuse problem?

  4. Are religious people less likely to have a substance abuse problem?

  5. Does making healthy choices (e.g., not smoking or eating plant-based diets) correlate negatively with substance abuse?

And last but not least, how accurately can we predict substance abuse with just the above criteria?

To tackle these questions, I’m going to use the OkCupid Profile Dataset, a file containing answers from 60k users covering a number of topics; from marital status, job, and kids, to alcohol and drugs consumption.

Just perfect for my purpose.

The dataset is anonymized and free for general use, with permission from OkCupid. (There are larger datasets on the internet, all of which were illegally scraped. Please avoid using these. Ethics, people.)

For more information on how I treated and analyzed the data, please check out my Github repository containing the Jupyter notebook for technical details.

Defining “substance abuse”

First, we need to define “substance abuse”. Here’s a reasonable definition from Wikipedia:

Substance abuse, also known as drug abuse, is the use of a drug in amounts or by methods that are harmful to the individual or others. (…) Drugs most often associated with this term include: alcohol, amphetamines, barbiturates, benzodiazepines, cannabis, cocaine, hallucinogens, methaqualone, and opioids.

The OkCupid dataset isn’t precise in terms of which substances users consume, but it gives us a few clues about quantity:

  • drinks has six possible values: very often, often, socially, rarely, desperately, not at all
  • drugs has three possible values: never, sometimes, often

I’m going to assume that users who use drugs often or drink very often are likely candidates for a substance problem.

I didn’t include entries where users answered “desperately” because this sounds more like something cool to signal on a dating site than a sincere answer.

Fact: After some data wrangling and nailing down the substance abuse definition, the dataset shows that 3% of OkCupid users have a potential substance abuse problem in contrast with 7% of the total population.

Which gender and age group has the most significant percentage of users with a possible substance abuse problem?

When it comes to age, there’s a clear downward trend in substance abuse, so that the probability of finding anyone binge drinking or using drugs is nearly zero among the older groups (at least on OkCupid).

Maybe this occurs because older people are less prone to take risks? Our data is too limited to come up with any speculation at this point. But suffice it to say that age is a good variable to predict substance abuse, thus efforts toward prevention and treatment should be directed at young adults.

There’s also a significant difference between genders, but only in the 30 to 39 and 40+ age groups (p < 0.05). However, the data can’t tell us why. Maybe women are less likely to develop a substance abuse problem because they are less exposed to drugs throughout their lives, or because they are more resilient, or less irresponsible—or all the explanations together.

Which profession is more vulnerable to substance abuse?

The graph above shows that the unemployed category is at a higher risk of developing a substance abuse problem than any other category. The likelihood of having a problem is also higher among those who would “rather not say” anything about their employment status, maybe because they are unemployed and embarrassed to openly expose themselves.

Are people with kids or a partner less likely to have a substance abuse problem?

Parenting correlates strongly with age, so it makes sense to break down offspring data by user age groups here. I’ve observed that users with kids are way more likely to have substance problems in their early adulthood (p < 0.01), but this effect decreases (and even reverses) as they get older (p < 0.05).

And how do relationships correlate with substance abuse?

The graph above suggests that people with stable relationships are less likely to have a substance abuse problem. But remember, we can’t assume that the kind of relationship is what steers people into addiction. It’s also important to keep in mind that relationships correlate with age as well. Older adults could be less prone to romantic adventures (or so it goes), and if age (or maturity) is a major factor behind substance abuse, people in open relationships will spuriously correlate with substance problems.

We have just discussed human relationships. But what about animal bonds?

As a cat owner myself, I slowly nod in agreement at this chart. Cat owners seem to correlate with substance abuse, but we can’t tell the reason only from this data. Maybe dogs require going on walks and cats don’t, so dog owners could have a better health. Or cats are lower maintenance so you can juggle a substance abuse issue with a cat. More data is definitely needed in this regard.

Does taking religion seriously help prevent substance abuse?

When asked about their religion, OkCupid users could pick from a pre-defined list and add a modifier (“laughing about it”, “somewhat serious about it”, etc.) I decided to focus on how taking religion seriously (independently of which) affected the likelihood of belonging to the substance abuse group.

Much to my surprise, religious people are among the likeliest to have a substance abuse problem in all age groups (p < 0.01). But a word of caution before jumping to conclusions: we can’t tell the cause and effect just by staring at correlations. Does hardcore religiosity lure people into alcohol and drug binging, or do people with substance problems take religion seriously because they seek solace in it?

Do healthy choices (e.g., not smoking and plant-based diets) correlate negatively with substance abuse?

The logic behind this hypothesis is that, maybe, and just maybe, people with healthy lifestyles are less likely to have a substance problem.

This is not what the above graph shows, however.

Vegetarianism didn’t correlate with substance abuse at all. On the other hand…

The graph suggest that smoking is an ominous sign of potential substance abuse problems, being common in all age groups (p < 0.0001).

Prediction from data

To see if I can predict substance abuse using just the mentioned answers, I used a Logistic Regression model (see my Jupyter notebook for the technical details on how I dealt with class imbalance and variable scaling). Besides the elegant simplicity of the technique, it can also show which variables caused the most impact on the model’s predictions.

According to the model, substance abuse is positively correlated with unemployment, religiosity, owning a cat (right?), and smoking; marriage, being a female, and being older negatively correlates with substance abuse. Vegetarianism, having a dog or children have no significant correlation.

To test the model, I separated 20% of the data before training it, which is going to be used as my validation set.

The results were promising:

Accuracy: 0.778
Recall: 0.825

This means that the model could correctly predict substance abuse in the validation set in 77.8% of the times, and correctly identify 82.5% of the positive cases.

Not bad for such a simple model and small dataset! See my GitHub repository for more details.

Acknowledgements

Thanks Albert Y. Kim & Adriana Escobedo-Land for providing the OkCupid dataset on GitHub, and Iris Rohr for the immense patience in proofreading my draft.