Terms of Reference :: What is Data Science?

Before I start delving too deeply into some of these topics and some of my ideas for using them to upgrade your human resources programs {among other things, but people analytics is my favorite subject}, I want to lay down some foundations and make sure we’re all talking about the same thing. This is also a good chance for those of you out there who are experts in the field to help educate! Today, let’s delve into data science.

What IS data science?

So what IS data science? No matter how you search the term, you get some variation on the following: a multidisciplinary blend of concepts from computer science, statistics, mathematical modeling, data inference, technology, and a variety of additional subjects, all of it used to solve complex problems involving large amounts of data.

Naturally, the rock upon which we build this skill set is data. Tons and tons and tons of data. The raw information that nowadays comes streaming in from various collection sources to be warehoused, examined, mined, and analyzed using a large variety of techniques. Data science is all about finding the best ways to use all of this data to uncover insights and help decision makers to make smarter data-driven decisions.

So how do we do this?

I can only speak from my own experience, but I attribute any of the very minor success I’ve had in the data realm to my affinity for detective novels. I grew up on Sherlock Holmes and expanded to a multitude of different thriller novels all devoted to solving complex crimes. To be a data scientist, you have to be a bit of a detective. Explore the evidence to uncover the story, investigate leads and begin testing hypotheses to discover trends, patterns, and connections within the data.

Analytical techniques and fancy tools aside, what we’re all really doing is getting a deeper understanding of what has happened within our data and what it can tell us about what will happen in the future, and how to shape that future.

And that’s a pretty damn cool thing, if you ask me.

Types of Data

Categorical Data: Categorical data is the type of data that lets you divide sets into groups, e.g. race, sex, age, employment status, marital status, etc. It can take on numerical values, but those values exist only to provide context in terms of the population grouping.

Numerical Data: Numerical data is, well, numbers. It’s everything that’s measurable, such as time, height, distance, amount, weight, and so forth. Can you average it or run other types of statistical analysis on it? It’s numerical.

Nominal Data: Nominal data is made of variables used to name or label things. It can’t be ordered or measured, even though it can be both qualitative and quantitative. Quantitative nominal data, however, doesn’t have any numerical value – it’s just a label (e.g. social security number).

Ordinal Data: Ordinal data is data that provides an order of choices. A good example of ordinal data is the Netflix 5-star rating scheme or the ordered choices in a customer satisfaction survey.

Structured Data: Structured data has a high degree of organization, is easily searchable by simple straight-forward algorithms, and is pretty easy to process. Of course, this is never the stuff we get called upon to analyze.

Unstructured Data: Unstructured data lacks organization and structure, is often text-heavy, and is difficult to analyze using traditional programs and algorithms. Much of what data science does is finding patterns in and interpreting unstructured data.

Big Data: How big is big? We discuss this a lot. But big data refers to extremely large data sets. But they’re not just large. They’re data sets that contain variety and increase in volume with great velocity. It’s those three v’s that separate large complex data sets from what we’d call big data.

Types of Data Science Problems

Here are some of the most common things data scientists and data analysts work on.

Anomaly Detection: This is one of the most fun things to do as a data analyst. This is when you dig into the data to find the “hmmm, that’s funny” moments, the rare events, the weird observations, the stuff that makes you scratch your head because it’s dramatically different from the rest of the data. There are lots of techniques to do this, but working on these problems always makes me feel like Sherlock Holmes.

Clustering: Clustering techniques allow you to group items in a complex data set so that objects in the group are more similar than objects outside the group. Sometimes you just don’t know where to draw the lines between categories in data. Cluster analysis can help you with that.

Association: Association is similar to clustering in that you’re finding patterns and relationships in large data sets. Association rules let you see the probability of relationships existing between items, even if they’re not necessarily like items.

Regression: If you design experiments or do testing of any kind, you have to know regression. Regression techniques let you analyze data and use the existing patterns to predict future behavior in terms of numeric values.

Classification: Classification assigns categories to data in your data set in order to help you more accurately predict future outcomes. Credit scores and detection of spam emails are examples of commonly used classification algorithms.

Dimension Reduction: We often deal with a complex amount of variables, so it’s useful to have techniques to help us decide which ones we need to consider in a problem and which ones really have no bearing on the problem. Dimension reduction lets us reduce the number of random variables we have to consider in a problem. Feature selection and feature extraction are subsets of this.

We’ll talk further about what these things mean and I’ll eventually crunch all this down into a cheat-sheet you can download. But in the meantime, what did I miss? What do you have more questions about? What can we delve more deeply into? I’d love to hear what you think!

Leave a Reply

%d bloggers like this: