Category Archives: Python Data Analysis

Data Analysis with Python

Statistics with Panda DataFrame

Below are the Pandas DataFrame statistical methods with their description:

MethodDescriptionExample
describeThis method returns a small table with descriptive statistics
print("Describe", sunspots.describe(),"\n")

#out
Describe        Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
count                        317.000000                      199.000000   
mean                          79.378233                        8.009548   
std                           61.998961                        3.808942   
min                            0.000000                        1.700000   
25%                           25.100000                        4.700000   
50%                           66.700000                        7.700000   
75%                          116.300000                       10.450000   
max                          269.300000                       19.100000   

       Number of Observations  Definitive/Provisional Indicator  
count              199.000000                             317.0  
mean              1467.678392                               1.0  
std               2463.925956                               0.0  
min                150.000000                               1.0  
25%                365.000000                               1.0  
50%                365.000000                               1.0  
75%                366.000000                               1.0  
max               9940.000000                               1.0
countThis method returns the number of non-NaN items
print("Non NaN observations", sunspots.count(),"\n")

#out
Non NaN observations Yearly Mean Total Sunspot Number    317
Yearly Mean Standard Deviation      199
Number of Observations              199
Definitive/Provisional Indicator    317
dtype: int64
madThis method calculates the mean absolute deviation, which is a robust measure similar to the standard deviation
print("MAD", sunspots.mad(),"\n")

#out
MAD Yearly Mean Total Sunspot Number      50.931061
Yearly Mean Standard Deviation         3.128860
Number of Observations              1838.628924
Definitive/Provisional Indicator       0.000000
dtype: float64
medianThis method returns the median. This is equivalent to the value at the 50th percentile
print("Median", sunspots.median(),"\n")

#out
Median Yearly Mean Total Sunspot Number     66.7
Yearly Mean Standard Deviation        7.7
Number of Observations              365.0
Definitive/Provisional Indicator      1.0
dtype: float64
minThis method returns the lowest value
print("Min", sunspots.min(),"\n")

#out
Min Yearly Mean Total Sunspot Number      0.0
Yearly Mean Standard Deviation        1.7
Number of Observations              150.0
Definitive/Provisional Indicator      1.0
dtype: float64
maxThis method returns the highest value
print("Max",	sunspots.max(),"\n")

#out
Max Yearly Mean Total Sunspot Number     269.3
Yearly Mean Standard Deviation        19.1
Number of Observations              9940.0
Definitive/Provisional Indicator       1.0
dtype: float64
modeThis method returns the mode, which is the most frequently occurring value
print("Mode",	sunspots.mode(),"\n")

#out
Mode    Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
0                              18.3                             9.2   

   Number of Observations  Definitive/Provisional Indicator  
0                   365.0                               1.0
stdThis method returns the standard deviation, which measures dispersion. It is the square root of the variance
print("Standard Deviation", sunspots.std(),"\n")

#out
Standard Deviation Yearly Mean Total Sunspot Number      61.998961
Yearly Mean Standard Deviation         3.808942
Number of Observations              2463.925956
Definitive/Provisional Indicator       0.000000
dtype: float64
varThis method returns the variance
print("Variance", sunspots.var(),"\n")

#out
Variance Yearly Mean Total Sunspot Number    3.843871e+03
Yearly Mean Standard Deviation      1.450804e+01
Number of Observations              6.070931e+06
Definitive/Provisional Indicator    0.000000e+00
dtype: float64
skewThis method returns skewness. Skewness is indicative of the distribution symmetry
print("Skewness", sunspots.skew(),"\n")

#out
Skewness Yearly Mean Total Sunspot Number    0.804414
Yearly Mean Standard Deviation      0.561530
Number of Observations              1.871952
Definitive/Provisional Indicator    0.000000
dtype: float64
kurtThis method returns kurtosis. Kurtosis is indicative of the distribution shape
print("Kurtosis", sunspots.kurt(),"\n")

#out
Kurtosis Yearly Mean Total Sunspot Number   -0.135040
Yearly Mean Standard Deviation     -0.243958
Number of Observations              1.850223
Definitive/Provisional Indicator    0.000000
dtype: float64

Querying Data in Pandas

A Pandas DataFrame is structured very similar to a relational database. We can view operations that read data from a DataFrame as a query. We are going to retrieve the annual sunspot data from Quandl.

We can use the Quandl API or we can download the data manually as a CSV file from Quandl website. To download the data you have to sign up. I will install the API by running the following command in the terminal:

pip3 install Quandl

Note:
Using the API is limited to 50 API calls per day.

Continue reading

Pandas Series

The Pandas Series is a one-dimensional, heterogeneous array with labels. We can create a Pandas Series data structure in three different ways:

Note:
When creating a Series, we can hand the constructor a list of axis labels (index). The index is an optional parameter.

  • With a Python dictionary – the sorted dictionary keys will become the index
  • With a NumPy array – the index values will be incremented starting from 0
  • With a single scalar value – we have to supply the index

Continue reading

The Pandas DataFrames

A Pandas DataFrame is a labeled two-dimensional data structure similar to a worksheet or a relational database table. The columns in Pandas DataFrame can be of different types.

A DataFrame can be created in the following ways:

  • Using another DataFrame
  • Using a NumPy array or a composite of arrays that has a two-dimensional shape
  • From another Pandas data structure called Series
  • From a file (a CSV file for example)
  • From a dictionary of one-dimensional structures (one-dimensional NumPy arrays, lists, dicts, or Pandas Series, for example)

Continue reading

Broadcasting a NumPy array

Broadcasting a NumPy array refers to a procedure executed on two operands even if they don’t have the same shape.

In this example we are going to multiply a scalar and an array. We will import a sound file and we will make it quieter.

import scipy.io.wavfile as sw
import matplotlib.pyplot as plt
import urllib
import numpy as np

request = urllib.request.Request('http://www.thesoundarchive.com/austinpowers/smashingbaby.wav')
response = urllib.request.urlopen(request)
print(response.info())
WAV_FILE = 'smashingbaby.wav'
filehandle = open(WAV_FILE, 'wb')
filehandle.write(response.read())
filehandle.close()
sample_rate, data = sw.read(WAV_FILE)
print("Data type", data.dtype, "Shape", data.shape)

plt.subplot(2, 1, 1)
plt.title("Original")
plt.plot(data)

plt.show()

newdata = data * 0.2
newdata = newdata.astype(np.uint8)
print("Data type", newdata.dtype, "Shape", newdata.shape)

sw.write("quiet.wav", sample_rate, newdata)

plt.subplot(2, 1, 2)
plt.title("Quiet")
plt.plot(newdata)

plt.show()

The output:

Content-Type: audio/wav
Last-Modified: Tue, 12 Aug 2014 15:53:38 GMT
Accept-Ranges: bytes
ETag: "ac20ba9445b6cf1:0"
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Tue, 26 Sep 2017 22:32:35 GMT
Connection: close
Content-Length: 43642


Data type uint8 Shape (43584,)
Data type uint8 Shape (43584,)

Broadcasting a NumPy array
Broadcasting a NumPy array quieter

Continue reading

Indexing NumPy arrays with Booleans

Boolean indexing is a kind of fancy indexing therefor it works pretty much the same.

import scipy.misc 
import matplotlib.pyplot as plt
import numpy as np 

face = scipy.misc.face() 
xmax = face.shape[0] 
ymax = face.shape[1] 
face = face[:min(xmax,ymax),:min(xmax,ymax)]

def get_indices(size):
    arr = np.arange(size)
    return arr % 4 == 0

face1 = face.copy()
xindices = get_indices(face.shape[0])
yindices = get_indices(face.shape[1])
face1[xindices, yindices] = 0
plt.subplot(211)
plt.imshow(face1)
face2 = face.copy()
face2[(face > face.max()/4) & (face < 3 * face.max()/4)] = 0
plt.subplot(212)
plt.imshow(face2)
plt.show()

#out
Indexing NumPy arrays with Booleans

Continue reading

Indexing with a list of locations

We are going to use again the Lena photo and apply the ix_() function to shuffle it.

import scipy.misc
import matplotlib.pyplot as plt
import numpy as np

face = scipy.misc.face() 
xmax = face.shape[0]
ymax = face.shape[1]

def shuffle_indices(size):
    arr = np.arange(size)
    np.random.shuffle(arr)

    return arr

xindices = shuffle_indices(xmax)
np.testing.assert_equal(len(xindices), xmax)
yindices = shuffle_indices(ymax)
np.testing.assert_equal(len(yindices), ymax)
plt.imshow(face[np.ix_(xindices, yindices)])
plt.show()

#out
Indexing with a list of locations

Continue reading

Fancy indexing

Fancy indexing is not conventional indexing that involves integers and slicing. If we take the Lena photo again we could draw two black diagonal lines on the photo by setting the diagonal values to 0.

import scipy.misc
import matplotlib.pyplot as plt

face = scipy.misc.face()

xmax = face.shape[0]
ymax = face.shape[1]

face = face[:min(xmax,ymax),:min(xmax,ymax)]

xmax = face.shape[0]
ymax = face.shape[1]

face[range(xmax),range(ymax)] = 0
face[range(xmax-1,-1,-1), range(ymax)] = 0

plt.imshow(face)
#out


plt.show()
#out
Fancy indexing

Continue reading

Creating array views and copies

You might think that views means just to view something and nothing more, like the ravel() example on an array. In NumPy, views are NOT read-only!

That’s why, having a replica of the shared array view is very important. A slice of an array will produce a view; if you assign the slice to a variable and alter the underlying array then the value of the variable will change.

Example:

Continue reading