How to use Excel in Data Science?

Data Science is one of the most exciting disciplines emerging in recent times. It combines aspects from Mathematics, Statistics, Computer Science, and Business – all to solve some real-life problems using data.

The Data Science ecosystem includes various tools and technologies to bring more insights into data. Here, we will focus on Microsoft Excel as one of those technologies that are widely used by Data Scientists.

 

How do you use Excel in Data Science?

Let’s start with a simple example to understand its use in DS.

To begin with, let us see how Data Scientists share their data with Excel to extract information from it. Data scientists typically store their own analysis results in Excel spreadsheets and make them available to anybody who is interested in seeing them. But these files are not shared with their original source -the data. The most important reason for this is that many times, the Data Scientists themselves don’t even know from where they have taken the data.

A typical scenario would be that a team of Data Scientists got together to identify some business problem and work on solving it using available data. Each of them would use any of their tools ( R, Python, Spark, SAS, etc.) to extract data and get some results. The end result would be multiple spreadsheets that each team member has on his computer.

The situation gets even more complicated when these files have to be archived for future reference or shared with other colleagues. If the Data Scientists themselves don’t know the origin of these files, then how do they expect other people to figure it out? As a result, most of the time, this information is never preserved and the results just get lost somewhere in cyberspace.

 

How can Excel help in such situations?

The answer lies in adding some meta-data to each individual spreadsheet. For example, they can add some appropriate information about when and how this data was collected. Also, add details about the organisation where this data originated from. Now each spreadsheet comes with some meta-information that lets us trace it back to its origin. In addition to this, if you are working in a team environment then adding details such as who collected the data and who did the analysis can be very useful when you need to revisit those results.

This is where Microsoft Excel comes in handy. It allows us to add a lot of meta-information on each column or cell that we have added to our spreadsheet and this extra information can be used by other tools to reconstruct the complete workflow that was followed to arrive at the results.

 

Excel Formulas Every Data Scientist or Analyst Should Know

Data science adds a new dimension to data modeling along with the need to integrate existing tools for solving business problems. In such an environment, you might be using more than one tool (Excel) in your work and hence it becomes necessary that you learn about the features offered by various formulas in Excel.

Here is a list of important Excel formulas that every data scientist should know.

VLOOKUP()

>The VLOOKUP formula looks up information from the left-most column to return the desired result. Let’s look at a simple example of how we can use VLOOKUP in a data science workflow.

Suppose you have a table that describes the monthly sales (pointing to a dimension such as a region) for 3 years in a row and you would like to know the corresponding revenue figures for each month.

 

To do this, first, let’s look at the data in Excel:

After getting the monthly sales figures, let us try to find the corresponding revenue. Using VLOOKUP we could write a formula as shown below to get this done.

=VLOOKUP(B3;$A$2:$A$721;4)

Here, A2 : A721 refers to the range of cells that contains data on Monthly sales. The 4th argument tells Excel to return the fourth column from this reference cell (i.e. in our case, it is revenue).

CONCATENATE

> This function is used to combine the text strings created by the user into a single text string. For example, if you need to combine multiple word columns (like First Name + Last Name) then we could use the CONCATENATE formula as shown below. 

=CONCATENATE(A3,B3)

LEFT

> This function returns the leftmost characters of a text string. Suppose we have values like 98765 and you want to get rid of unwanted zeros then you could use the following formula. 

=LEFT(A2;2)  Returns the string: 09

RIGHT

> Use this function to return the rightmost values from text strings. If you would like to get rid of unwanted zeroes then you can write a formula as follows. 

=RIGHT(A2;2)  Returns the string: 65

MID

> Use this function to extract the middle part of text strings. Suppose you have a string like “He is a user” and would like to extract the words “he is” then we can use the following formula.

=MID(A2;FIND(” “,A2)-1;7)  Returns the string: He is

SUMIF and SUMIFS

> The SUMIF formula adds together all the cells that meet a certain criterion. For example, you can use it in a data science project to get the total sales for each product category.

Suppose your table contains data related to monthly sales for different product categories and you would like to know how much money was spent on Sales promotions (i.e. marketing expenses). Using SUMIFS you can write the following formula.

=SUMIF(C2:C6;’Marketing Expenses’;B2:B6)

This returns a total amount spent on Marketing (i.e. Sum of figures in the B column for cells that contain Marketing expenses anywhere in their respective row). You can verify this by looking at the values in row 2.

If you have multiple conditions for selecting a range then you can specify more than one Excel criteria but these should be separated using an OR (i.e. ; ) sign as shown in the formula below.

=SUMIFS(C2:C6;’Marketing Expenses’;B2:B6;D2:D6;C3:C7)

LEN

> This function returns the number of characters in a given cell. You can use this formula to get the number of characters or words in a text string.

For example, if you need to find the total number of words in all the columns where the Title column contains more than 2 words then you could write a formula as shown below.

=SUM(LEN(A2:A7))

LOWER() and UPPER()

> These functions return a cell in lower or upper case respectively. Here is an example of using the LOWER formula.

=LOWER(A2)

Here is another example where we use the UPPER function to convert a letter grade given by a teacher to a number out of 100:

=UPPER(D4)

PROPER()

> This function is used to return a proper case version of the cell contents. For example, if you have a column containing the name of each employee then this formula could be useful in converting it into Proper Case format. This can help us later if we need to use those names as part of SQL queries or create pivot tables or reports. The formula for this is given below.

=PROPER(A2)

TRIM()

> This function is useful if you want to strip whitespace from either side of the cell’s contents. You can use it in data science projects like cleaning up your Input Data or removing unwanted characters based on some criteria. Let’s consider an example where we have a column containing the product description and it has leading zeroes such as 0001. We would like to get rid of those leading zeroes using this formula.

=TRIM(A2)

 

IF() and ISNA()

> These functions are very useful in writing conditional statements. For example, you could use the IF function to write an If Statement that takes action if a certain condition is true. Let’s consider another example where we want to find out how many sales were made for each product category and the respective month. The formula for this is given below.

=IF(AND(A2:A4=”CT”; B2:B4=”Jan”),SUM(B9:B12),0)

Here, A2 through A4 column has the product category and B2 through B4 column has the month. In cell E3, we have this formula.

=IF(AND(A2:A4=”CT”; B2:B4=”Jan”),SUM(B9:B12),0)

This returns the total sales for CT category and Jan month. We use SUM function if the criteria are satisfied otherwise returns 0 (i.e. the cell is blank).

You can also use Excel functions to find out the #N/A error values in your dataset. Here’s how.

=ISNA(B7)

 

SUBSTITUTE()

> This function is very useful when you need to replace one value with another. For example, if you want to get the total sales for each product category and month but instead of using the original values (i.e. A2:A4;B2:B4), you can use a different cell range such as C3:C7 to get the same result. The formula for this is given below.

=SUBSTITUTE(B9;C3:C7;A2:A4)

You can also use this function in conjunction with the IF statement to create a dynamic chart that shows sales figures based on different product categories and months. The formula for this is given below.

=IF(SUBSTITUTE(B9;C3:C7;A2:A4)=0,B9,””)

This results in a chart that shows the total sales figures only if there are any values to show. Otherwise, it returns a blank cell.

 

MINIFS/MAXIFS

> These functions are used to find the minimum/maximum value in a cell range. The formula for this is given below.

=MINIFS(C2:C6;D2:D6)

Here, C5 and C6 contain the start and end columns for which we need to find the maximum value respectively. Similarly, D5 and D6 contain the start and end column for which we need to find the minimum value respectively.

=MAXIFS(B2:B6;C2:C6;D2:D6)

Here, B7 and B8 contain the start and end rows for which we need to find the maximum or minimum value respectively. Similarly, C7 and C8 are the range of cells used to check if there are any #NA Errors in your dataset.

Generate predictions from your Data

 

How to plot a Scatter Plot?

A scatter plot is a simple chart that plots the relationship between two variables. It’s often used to predict future values of one variable based on its current value and vice versa. Here’s a useful formula for creating a scatter plot:

=G2*G$4+H2*H$4 (to create a scatter plot between Y-axis = Sales and X-axis = Month)

You can add descriptive labels such as sales vs. month to make it more meaningful. The formula for this is given below:

=G2*G$4+H2*H$4

 

How to generate predictions using Auto Correlation Function?

Auto Correlation Function is a useful function for generating predictions using previous values. Here’s a formula for this:

=CORREL(B2:B15;C2:C15) (This evaluates the correlation between B2 to B15 based on C2 to C15. Thus, it can be used to predict future values of B2 by correlating C2.)

 

How to generate predictions using ARIMA (Auto-Regressive Integrated Moving Average)?

ARIMA is an advanced forecasting method that uses the statistical analyses of past and current data to predict future data. This technique involves smoothing and predicting the error terms in a regression analysis. It can be used to predict future values of the data by correlating past and current data. Here’s a formula for ARIMA:

=ARIMA(B2:B15,0,1,1) (This calculates an ARIMA model based on B2 to B15 that predicts future values of B2 and B15.)

 

How to create Charts in Data Science using Excel?

Creating charts in Excel is an easy task. Here’s a useful chart formula for this:

=CHART(B2:B6,”Product Category”) (This creates a bar chart with labels on the X-axis and Y-axis)

=CHART(C2:C6;D$1:D5,”Product Category”) (This creates a bar chart with labels on the X-axis and Y-axis for the data in D$1 through D5.)

> You can also create scatter plots using Excel. Here’s a useful formula for this:

=CHART(B2:E2;”Sales vs Month”) (This creates a graph with the labels on both axes)

> You can also create box plots using Excel. Here’s a useful formula for this:

=CHART(B2:E2;”Box Plot”;”Sales vs Month”) (This creates a graph with the labels on both axes and adds a box plot to each scatter plot)

> You can also create a line chart with descriptive labels of data series and axis. Here’s a useful chart formula for this:

=CHART(D2:E6;”Linear Chart”;”Sales vs Month”) (This creates a graph with the labels on both axes)

 

How to generate predictions using SVM?

SVM is a popular machine learning algorithm that’s useful for the classification of data. It can be used to get an output based on the training data. Here’s a formula for SVM:

= SUPPORT VECTOR MACHINE(A2:B4,”Product Category”) (This calculates the support vector machine prediction based on A2 to B4)

 

How to generate predictions using KNN?

K-Nearest Neighbours is a popular method of predicting data based on the input data. Here’s a useful formula for this:

=K Nearest Neighbor(A2:B15,”Product Category”) (This calculates the output for the input value of A2 to B15)

 

How to sort a data set using Excel in Data Science?

You can easily sort data by calculating the column values for any order-based sorting. Here’s a formula for this:

=INDEX(B$1:C$9;C34+(A14-1)*$A$34)+1 (This sorts a data set based on its descending order)

 

How to create a Pareto chart using Excel?

Creating Pareto charts is easy with Excel. Here’s a useful formula for this:

=Pareto(B2:B11,”Product Category”) (This creates a Pareto chart based on B2 to B11)

 

How to create Box-and-Whisker graph using Excel?

You can easily create a box and whisker graph in Excel. Here’s a formula for this:

=BOXPLOT(B2:D9,”Product Category”) (This creates a box and whisker graph based on B2 to D9)

 

How to create Pivot Tables using Excel in Data Science?

Pivot tables are useful for analysing data. Here’s a formula for this:

=PIVOT(A2:B15,”Product Category”,”Sales vs Month”) (This generates a pivot table based on the input data)

 

How to group results using Excel in Data Science?

You can use Excel to group the results of a particular output by calculating for each item and put that value in a block of rows. Here’s a formula for this: =IF(A2:A11=”New York”,B$1;IF(A2:A11=”San Francisco” ,C$1; IF(A2:A11=”Dallas”,D$1;F12))

This generates the result in a table based on the value of A2 to A11. You can use the formula to generate multiple results, by changing each of A2 to A11.

> You can also group data based on weighted means. Here’s a useful formula for this: = WEIGHTED MEAN(C$1:D$9)

This generates weighted mean values from C1 to D9)

 

Which formula is better for Data Science?

As you can see, Excel provides a lot of tools to use in data science, and hence, it’s not possible to decide which one is good and which one is bad.

However, we would like to add that formulas have their own limitations: they cannot be parallelised (i.e. they are suitable for a single-threaded execution).

Since data science algorithms run on multiple machines simultaneously, you can consider using DataFrames or Spark which helps us do distributed processing.

 

Takeaway

These are some of the most common formulas used in Excel for Data Science. 

Excel is arguably one of the best tools ever made, and it has remained the gold standard for nearly all businesses worldwide. But whether you’ve been working in your industry for 5 years or 15, there are always more skills to learn.

Want us to cover any specific topics around Data Science? Let us know!

 

References: