User:Kithira/Course Pages/CSCI 12/Assignment 2/Group 4/Homework 4

Method of Filtering

To identify spikes in the data, we choose to look at the standard deviation of data points in ranges of 1 minute. We identify outliers in the standard deviation measurements (by again using standard deviation) and mark these minutes as times to look for spikes in. Then, we take the standard deviation of all the second-long measurements in the previously identified minute ranges. This gives us a measure of what points are outliers which we then filter out. This is step one. We repeat the process, switching the time interval by thirty seconds. This is accomplished by checking that the first thirty measurements have indeed not been marked as possible outliers in step one, then eliminating them temporarily from our data. Then, we repeat step one on this new data set. This gives us two lists of possible outliers which we augment and analyze to determine the spikes in the data.

Filtering Code

The code is explained in the comments.

from numpy import std
 
input = open('data.txt', 'r')
file = open('min.txt', 'w')
#When we run again, we change spikes to spikes2.txt
spikes = open('spikes.txt', 'w')
lines = input.readlines()
 
#A temporary list of accelerations per second in a minute time period
accelist = []
stdev = 0
#A list of all the standard deviations per minute time period
devlist = []
#The corresponding time stamps to the devlist
timelist = []
 
"""
How to eliminate first thirty data points
for i in range (0, 31, 1):
   input.readline().strip()
"""
 
for line in lines:
 
    linelist = line.split()
    date = linelist[0]
    time = linelist[1]
    activity = linelist[2]
    accelist.append(float(activity))
 
    #After a minute of seconds has been collected
    if len(accelist) == 60:
          sum = 0
          #Average the acceleration marks
          for i in accelist:
              sum = sum + float(i)
              avg = sum/60.0
 
          stdev = std(accelist)
          devlist.append(stdev)
          tempList = [linelist[0], linelist[1], str(avg)]
          timelist.append(tempList)
          # Writes preliminary data file with unfiltered time stamps.
          file.write(date + "   "+ time + "   " + str(avg) + "   " + str(stdev))
          file.write("\n")
          accelist = []
 
input.close()
 
# A list of data points that have abnormally high standard deviations
lookAt = []
# The overall standard deviation for the entire set of data
S = 2*std(devlist)
 
for i in range(0, len(devlist), 1):
  # higher levels of activity will have higher differences in acceleration, thus the < .3
  if (devlist[i] > S and float(timelist[i][2]) < 0.3):
    print timelist[i][2]
    lookAt.append(timelist[i])
 
 
def uniqueTime(date, time):
  '''creates a unique timestamp for the particular minute'''
  d = date.split("-")[2]
  h = time[:2]
  m = time.split(":")[1]
  s = time.split(":")[2][:2]
  unique = d+h+m+s
  return int(unique)
# A list that associates a unique timestamp with the data that needs to be analyzed
uniqueLookAt = []
for elem in lookAt:
  x = uniqueTime(elem[0],elem[1])
  uniqueLookAt.append(x)
 
# A list of second timestamps that need to be analyzed
errorlist = []
 
# A list of corresponding accelerations that need to be analyzed
accerrorlist = []
 
# IDs the seconds that need to be analyzed
for i in range(0,len(lines),1):
  linelist = lines[i].split()
  date = linelist[0]
  time = linelist[1]
  x = uniqueTime(date,time)
  for a in uniqueLookAt:
      if (x - a < 100 and x - a > 0):
        accerrorlist.append(float(linelist[2]))
        errorlist.append(linelist)
 
  i += 1
 
AccStd = std(accerrorlist)
 
lookAt = []
# Checks the data points and IDs them as outliers and possible spikes
for i in range (0, len(errorlist), 1):
  if (accerrorlist[i] > (2 * AccStd)):
     lookAt.append(errorlist[i])
     spike = errorlist[i][0]+"   "+errorlist[i][1]+"   "+errorlist[i][2]
     spikes.write(spike)
     spikes.write("\n")

Augmentation Code

Code used to augment the spikes.txt with the spikes2.txt

spike1 = open('spikes.txt', 'r')
spike2 = open('spikes2.txt', 'r')
data = open ('data.txt', 'r')
final = open('finalspikes.txt', 'w')
filtered = open('filteredData.txt', 'w')

lines1 = spike1.readlines()
lines2 = spike2.readlines()


for line1 in lines1:
  for line2 in lines2:
    if line1 == line2:
      final.write(line1)


final.close()

final = open('finalspikes.txt', 'r')

spikelines = final.readlines()
datalines = data.readlines()

for line in datalines:
  if line in spikelines:
    filtered.write("spike ")

  filtered.write(line)