# Category Archives: Python Data Analysis

Data Analysis with Python

# Statistics with Panda DataFrame

Below are the Pandas DataFrame statistical methods with their description:

MethodDescriptionExample
describeThis method returns a small table with descriptive statistics
``````print("Describe", sunspots.describe(),"\n")

#out
Describe        Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
count                        317.000000                      199.000000
mean                          79.378233                        8.009548
std                           61.998961                        3.808942
min                            0.000000                        1.700000
25%                           25.100000                        4.700000
50%                           66.700000                        7.700000
75%                          116.300000                       10.450000
max                          269.300000                       19.100000

Number of Observations  Definitive/Provisional Indicator
count              199.000000                             317.0
mean              1467.678392                               1.0
std               2463.925956                               0.0
min                150.000000                               1.0
25%                365.000000                               1.0
50%                365.000000                               1.0
75%                366.000000                               1.0
max               9940.000000                               1.0``````
countThis method returns the number of non-NaN items
``````print("Non NaN observations", sunspots.count(),"\n")

#out
Non NaN observations Yearly Mean Total Sunspot Number    317
Yearly Mean Standard Deviation      199
Number of Observations              199
Definitive/Provisional Indicator    317
dtype: int64``````
madThis method calculates the mean absolute deviation, which is a robust measure similar to the standard deviation
``````print("MAD", sunspots.mad(),"\n")

#out
MAD Yearly Mean Total Sunspot Number      50.931061
Yearly Mean Standard Deviation         3.128860
Number of Observations              1838.628924
Definitive/Provisional Indicator       0.000000
dtype: float64``````
medianThis method returns the median. This is equivalent to the value at the 50th percentile
``````print("Median", sunspots.median(),"\n")

#out
Median Yearly Mean Total Sunspot Number     66.7
Yearly Mean Standard Deviation        7.7
Number of Observations              365.0
Definitive/Provisional Indicator      1.0
dtype: float64``````
minThis method returns the lowest value
``````print("Min", sunspots.min(),"\n")

#out
Min Yearly Mean Total Sunspot Number      0.0
Yearly Mean Standard Deviation        1.7
Number of Observations              150.0
Definitive/Provisional Indicator      1.0
dtype: float64``````
maxThis method returns the highest value
``````print("Max",	sunspots.max(),"\n")

#out
Max Yearly Mean Total Sunspot Number     269.3
Yearly Mean Standard Deviation        19.1
Number of Observations              9940.0
Definitive/Provisional Indicator       1.0
dtype: float64``````
modeThis method returns the mode, which is the most frequently occurring value
``````print("Mode",	sunspots.mode(),"\n")

#out
Mode    Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
0                              18.3                             9.2

Number of Observations  Definitive/Provisional Indicator
0                   365.0                               1.0``````
stdThis method returns the standard deviation, which measures dispersion. It is the square root of the variance
``````print("Standard Deviation", sunspots.std(),"\n")

#out
Standard Deviation Yearly Mean Total Sunspot Number      61.998961
Yearly Mean Standard Deviation         3.808942
Number of Observations              2463.925956
Definitive/Provisional Indicator       0.000000
dtype: float64``````
varThis method returns the variance
``````print("Variance", sunspots.var(),"\n")

#out
Variance Yearly Mean Total Sunspot Number    3.843871e+03
Yearly Mean Standard Deviation      1.450804e+01
Number of Observations              6.070931e+06
Definitive/Provisional Indicator    0.000000e+00
dtype: float64``````
skewThis method returns skewness. Skewness is indicative of the distribution symmetry
``````print("Skewness", sunspots.skew(),"\n")

#out
Skewness Yearly Mean Total Sunspot Number    0.804414
Yearly Mean Standard Deviation      0.561530
Number of Observations              1.871952
Definitive/Provisional Indicator    0.000000
dtype: float64``````
kurtThis method returns kurtosis. Kurtosis is indicative of the distribution shape
``````print("Kurtosis", sunspots.kurt(),"\n")

#out
Kurtosis Yearly Mean Total Sunspot Number   -0.135040
Yearly Mean Standard Deviation     -0.243958
Number of Observations              1.850223
Definitive/Provisional Indicator    0.000000
dtype: float64``````

# Querying Data in Pandas

A Pandas DataFrame is structured very similar to a relational database. We can view operations that read data from a DataFrame as a query. We are going to retrieve the annual sunspot data from Quandl.

We can use the Quandl API or we can download the data manually as a CSV file from Quandl website. To download the data you have to sign up. I will install the API by running the following command in the terminal:

``pip3 install Quandl``

Note:
Using the API is limited to 50 API calls per day.

# Pandas Series

The Pandas `Series` is a one-dimensional, heterogeneous array with labels. We can create a Pandas `Series` data structure in three different ways:

Note:
When creating a Series, we can hand the constructor a list of axis labels (index). The index is an optional parameter.

• With a Python dictionary – the sorted dictionary keys will become the index
• With a NumPy array – the index values will be incremented starting from 0
• With a single scalar value – we have to supply the index

# The Pandas DataFrames

A Pandas DataFrame is a labeled two-dimensional data structure similar to a worksheet or a relational database table. The columns in Pandas DataFrame can be of different types.

A DataFrame can be created in the following ways:

• Using another DataFrame
• Using a NumPy array or a composite of arrays that has a two-dimensional shape
• From another Pandas data structure called `Series`
• From a file (a CSV file for example)
• From a dictionary of one-dimensional structures (one-dimensional NumPy arrays, lists, dicts, or Pandas Series, for example)

# Pandas introduction

Pandas is a very popular Python library and its name comes from Panel Data and Data Analysis.

The official documentation insists on naming `pandas` with lowercase letters and the import statement should be `import pandas as pd`.

Broadcasting a NumPy array refers to a procedure executed on two operands even if they don’t have the same shape.

In this example we are going to multiply a scalar and an array. We will import a sound file and we will make it quieter.

``````import scipy.io.wavfile as sw
import matplotlib.pyplot as plt
import urllib
import numpy as np

request = urllib.request.Request('http://www.thesoundarchive.com/austinpowers/smashingbaby.wav')
response = urllib.request.urlopen(request)
print(response.info())
WAV_FILE = 'smashingbaby.wav'
filehandle = open(WAV_FILE, 'wb')
filehandle.close()
print("Data type", data.dtype, "Shape", data.shape)

plt.subplot(2, 1, 1)
plt.title("Original")
plt.plot(data)

plt.show()

newdata = data * 0.2
newdata = newdata.astype(np.uint8)
print("Data type", newdata.dtype, "Shape", newdata.shape)

sw.write("quiet.wav", sample_rate, newdata)

plt.subplot(2, 1, 2)
plt.title("Quiet")
plt.plot(newdata)

plt.show()``````

The output:

``````Content-Type: audio/wav
Last-Modified: Tue, 12 Aug 2014 15:53:38 GMT
Accept-Ranges: bytes
ETag: "ac20ba9445b6cf1:0"
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Tue, 26 Sep 2017 22:32:35 GMT
Connection: close
Content-Length: 43642

Data type uint8 Shape (43584,)
Data type uint8 Shape (43584,)  ``````

# Indexing NumPy arrays with Booleans

Boolean indexing is a kind of fancy indexing therefor it works pretty much the same.

``````import scipy.misc
import matplotlib.pyplot as plt
import numpy as np

face = scipy.misc.face()
xmax = face.shape
ymax = face.shape
face = face[:min(xmax,ymax),:min(xmax,ymax)]

def get_indices(size):
arr = np.arange(size)
return arr % 4 == 0

face1 = face.copy()
xindices = get_indices(face.shape)
yindices = get_indices(face.shape)
face1[xindices, yindices] = 0
plt.subplot(211)
plt.imshow(face1)
face2 = face.copy()
face2[(face > face.max()/4) & (face < 3 * face.max()/4)] = 0
plt.subplot(212)
plt.imshow(face2)
plt.show()

#out ``````

# Indexing with a list of locations

We are going to use again the Lena photo and apply the `ix_()` function to shuffle it.

``````import scipy.misc
import matplotlib.pyplot as plt
import numpy as np

face = scipy.misc.face()
xmax = face.shape
ymax = face.shape

def shuffle_indices(size):
arr = np.arange(size)
np.random.shuffle(arr)

return arr

xindices = shuffle_indices(xmax)
np.testing.assert_equal(len(xindices), xmax)
yindices = shuffle_indices(ymax)
np.testing.assert_equal(len(yindices), ymax)
plt.imshow(face[np.ix_(xindices, yindices)])
plt.show()

#out ``````

# Fancy indexing

Fancy indexing is not conventional indexing that involves integers and slicing. If we take the Lena photo again we could draw two black diagonal lines on the photo by setting the diagonal values to 0.

``````import scipy.misc
import matplotlib.pyplot as plt

face = scipy.misc.face()

xmax = face.shape
ymax = face.shape

face = face[:min(xmax,ymax),:min(xmax,ymax)]

xmax = face.shape
ymax = face.shape

face[range(xmax),range(ymax)] = 0
face[range(xmax-1,-1,-1), range(ymax)] = 0

plt.imshow(face)
#out

plt.show()
#out ``````

# Creating array views and copies

You might think that views means just to view something and nothing more, like the `ravel()` example on an array. In NumPy, views are NOT read-only!

That’s why, having a replica of the shared array view is very important. A slice of an array will produce a view; if you assign the slice to a variable and alter the underlying array then the value of the variable will change.

Example: