Mean and Covariance of Multivariate data with Python

Mean and Covariance of Multivariate data with Python

Links to the data files — file.io/9KK9gi3MXzpj

image.png

Questions

Question 1 — Using the Excel file dataA.xlsx, which contains a 500x3 data matrix (500 data points with 3 attributes), calculate both the mean and the covariance matrix.

Question 2 — Using the Excel file dataB.xlsx, which contains a 500x10 data matrix (500 data points with 10 attributes), calculate both the mean and the covariance matrix.

Question 3 — The data generated is random and normally distributed with a mean for dataA, dataB and covariance for dataA and dataB given in meanA.xlsx, meanB.xlsx, covarianceA.xlsx and covarianceB.xlsx respectively. Briefly explain why your answers are different from the parameters used to generate the data.

Implementation

Below is the python code for calculating mean and covariance matrix -

import pandas
import numpy
import matplotlib.pyplot

# Read dataA.xlsx and dataB.xlsx from excel file
dataA = pandas.read_excel("dataA.xlsx", header=None)
dataB = pandas.read_excel("dataB.xlsx", header=None)

# Convert data to numpy array
datanpyA = pandas.DataFrame.to_numpy(dataA)
datanpyB = pandas.DataFrame.to_numpy(dataB)

# Plot the data
matplotlib.pyplot.figure()
matplotlib.pyplot.scatter(datanpyA[:,0], datanpyA[:,1], c = 'r', marker = '.')
matplotlib.pyplot.scatter(datanpyB[:,0], datanpyB[:,1], c = 'b', marker = '.')

# Calculate the mean
meanA = numpy.mean(datanpyA,axis = 0)
meanB = numpy.mean(datanpyB,axis = 0)

# Subtract mean from the data
datawithoutmeanA = datanpyA - meanA
datawithoutmeanB = datanpyB - meanB

# Calculate covariance (C=X^T.X/(n-1))
covA = numpy.dot(numpy.transpose(datawithoutmeanA), datawithoutmeanA)/(len(datawithoutmeanA) - 1)
covB = numpy.dot(numpy.transpose(datawithoutmeanB), datawithoutmeanB)/(len(datawithoutmeanB) - 1)


# Print the Mean
numpy.set_printoptions(suppress=True)
print("Question 1 Solution:")
print("Mean A =>\n", meanA)
print("Covariance A =>\n", covA)
print("\n------------------------------\n")
print("Question 2 Solution:")
print("Mean B =>\n", meanB)
print("Covariance B =>\n", covB)
print("\n------------------------------\n")
print("Question 3 Solution")
print("The estimate given in mean and covariance excel file is based on population vs what was calulated using python is just a mean and covariance of a sample(data A and data B excel files). \nSince the data is randomly generated from a multivariate normal distribution using meanA and covarianceA for dataA, and meanB and covarianceB for dataB, \nthe resulting mean and covariance from the generated data clouds will not be the exact values as the given starting mean and covariance matrix but will be pretty close.")

matplotlib.pyplot.show()



#-------------- OUTPUT ----------------#
# Question 1 Solution:
# Mean A =>
#  [0.34750193 1.02563712 0.80122132]
# Covariance A =>
#  [[4.0704887  0.1502016  0.26208365]
#  [0.1502016  2.56307135 0.01468606]
#  [0.26208365 0.01468606 3.18321243]]

# ------------------------------

# Question 2 Solution:
# Mean B =>
#  [9.57062029 6.15014874 8.08016477 9.55989208 8.8040749  2.19491256
#  0.20634971 4.54942571 0.06659806 4.65575632]
# Covariance B =>
#  [[ 9.57410499  0.15742552  0.69100599 -0.04315714 -0.15529541  1.12934141
#    0.02644636 -0.48654602  0.95636371  0.53327821]
#  [ 0.15742552  9.51640579  0.47757067  0.41333501  0.00557376  0.53194456
#    0.11100153  0.17033133  0.84105524  1.33915044]
#  [ 0.69100599  0.47757067  8.65741988 -0.31162145  0.16556618  0.19225256
#    0.18585505  0.4101727   0.22889477 -0.15427328]
#  [-0.04315714  0.41333501 -0.31162145 10.27052739  0.2510052   0.34881198
#    0.68992571  0.32255801  0.72253427  1.0499889 ]
#  [-0.15529541  0.00557376  0.16556618  0.2510052   9.65117562  0.65088712
#    0.15264545 -0.16605455  1.35788702 -0.19805019]
#  [ 1.12934141  0.53194456  0.19225256  0.34881198  0.65088712 10.91504476
#    0.80109036  0.33946519  0.09688857  1.34008328]
#  [ 0.02644636  0.11100153  0.18585505  0.68992571  0.15264545  0.80109036
#    9.26492074 -0.19919067 -0.21481801  0.85962642]
#  [-0.48654602  0.17033133  0.4101727   0.32255801 -0.16605455  0.33946519
#   -0.19919067  9.1616525   0.29534256  0.13637128]
#  [ 0.95636371  0.84105524  0.22889477  0.72253427  1.35788702  0.09688857
#   -0.21481801  0.29534256 10.14780653  1.96290271]
#  [ 0.53327821  1.33915044 -0.15427328  1.0499889  -0.19805019  1.34008328
#    0.85962642  0.13637128  1.96290271 10.60596324]]

# ------------------------------

# Question 3 Solution
# The estimate given in mean and covariance excel file is based on population vs what was calulated using python is just a mean and covariance of a sample(data A and data B excel files). 
# Since the data is randomly generated from a multivariate normal distribution using meanA and covarianceA for dataA, and meanB and covarianceB for dataB, 
# the resulting mean and covariance from the generated data clouds will not be the exact values as the given starting mean and covariance matrix but will be pretty close.

Did you find this article valuable?

Support Dhruv Dakoria by becoming a sponsor. Any amount is appreciated!