반응형

pip install pandas

 

현재 폴더 경로 

각 폴더에는 아래와 같이 전처리 되지 않은 CSV가 존재한다.

JS01

 

JS02

 

JS03

[JS01-210210_222834_p9_고정, JS02-210210_222841_p9고정, JS03-210210_222859_p9_고정]과 같이 데이터 프레임을 병합하고, 분석에 알맞게 전처리를 하는 과정을 보여준다.  

 

아래는 실험 데이터로 중간에 빈 데이터가 보이고, 전처리 과정이 필요한 원본 데이터이다.  

 

raw data
preprocessing data


이제부터 Pandas를 통해 데이터 분석에 알맞게 전처리를 할 것이다.  (위 preprocessing data가 최종 목표 데이터)

 

 

 

* Section 1 *

 

0. 모든 CSV를 불러온다. 

1. Null/NaN(결손) 데이터를 제거한다. 

   ex) (df : dataframe) df.isna.sum() 를 통해서 결손데이터를 파악할 수 있다. ( df.info()도 확인가능하다. )

2. 필요없는 열(bootms)을 삭제한다. 

3. Column 이름을 변경한다. (" aCnt" ▶"aCnt", " RSSI" "RSSI") 

   ( " aCnt"와 "aCnt"의 차이는 띄어쓰기이다. df.columns 를 통해 정확한 column name을 작성한다. 띄어쓰기 구별 )

4. 1001 이상 aCnt 행을 삭제한다. 

5. aCnt를 Group 내어 평균한다. 

6. CSV 데이터로 저장한다. 

 

import pandas as pd
import os


# 모든 CSV 경로를 변수에 저장한다..
list_folder_name = ['JS01', 'JS02', 'JS03']
list_csv_folder = [os.path.join(os.getcwd(), file_name) for file_name in list_folder_name]
list_p = []
list_path = []

for i in list_csv_folder:
    _, _, file_names = next(os.walk(i))
    for n, j  in enumerate(file_names):
        list_path.append(os.path.join(i,j))
        print(list_path[n])
        
# result 
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_222834_p9_고정.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_224020.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225507.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225821.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230150.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230314.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230540.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230845.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_222834_p9_고정.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_224020.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225507.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225821.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230150.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230314.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230540.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230845.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_222834_p9_고정.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_224020.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225507.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225821.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230150.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230314.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230540.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230845.csv

 

# 파일 별로 1번 ~ 6번 과정을 반복한다.
# 0. 모든 CSV를 불러온다. 
for i in range(len(list_path)):
    print(list_path[i])
    # 0_1. CSV를 불러온다.
    csv_data = pd.read_csv(list_path[i])
    
    # 1. Null/NaN(결손) 데이터를 제거한다. 
    csv_data.dropna(inplace=True)
    
    # 2. 필요없는 열(bootms)을 삭제한다. 
    csv_data.rename(columns={" aCnt":"aCnt", " RSSI":"RSSI"}, inplace=True)
    
    # 3. Column 이름을 변경한다. (" aCnt" ▶"aCnt", " RSSI" ▶ "RSSI") 
    csv_data.drop(columns=['bootms'], inplace=True)
    
    # 4. 1001 이상 aCnt 행을 삭제한다.
    csv_data.query('aCnt<=1000', inplace=True)

    # 5. aCnt를 Group 내어 평균한다. 
    csv_data_group = csv_data.groupby(['aCnt']).mean().round(1)
    
    # 6. CSV 데이터로 저장한다. 
    csv_data_group.to_csv(list_path[i], mode='w', index=True)

head(5)
tail(5)

1차적으로 필요없는 열 제거, 결손 데이터 제거 등 필수 작업은 끝이 났다. 

 

이제 분석을 하기 위해 Dataframe 병합, 분석에 필요한 열 추가를 할 것이다. 

 

 

 

* Section 2 *

 

import pandas as pd
import numpy as np
import os


list_folder_name = ['JS01', 'JS02', 'JS03']
list_csv_folder = [os.path.join(os.getcwd(), file_name) for file_name in list_folder_name]
list_path = []

for i in list_csv_folder:
    _, _, file_names = next(os.walk(i))
    for j in file_names:
        list_path.append(os.path.join(i,j))
        
# result 
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_222834_p9_고정.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_224020.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225507.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225821.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230150.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230314.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230540.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230845.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_222834_p9_고정.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_224020.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225507.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225821.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230150.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230314.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230540.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230845.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_222834_p9_고정.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_224020.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225507.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_225821.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230150.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230314.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230540.csv
c:\Users\mangnani\Desktop\Develop\VS Code\Python37\[티스토리]__Pandas전처리\JS01\210210_230845.csv        

 

np_path = np.array(list_path) # (24)
np_path = np_path.reshape(len(list_folder_name), -1) # (3, 8)
np_path_trans = np_path.T  # Transpose (8, 3)  
print(np_path_trans)

# result 
"""
[['c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS01\\210210_222834_p9_고정.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS02\\210210_222841_p9고정.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS03\\210210_222859_p9_고정.csv']
 ['c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS01\\210210_224020.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS02\\210210_224045.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS03\\210210_224041.csv']
 ['c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS01\\210210_225507.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS02\\210210_225511.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS03\\210210_225519.csv']
 ['c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS01\\210210_225821.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS02\\210210_225828.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS03\\210210_225837.csv']
 ['c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS01\\210210_230150.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS02\\210210_230159.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS03\\210210_230155.csv']
 ['c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS01\\210210_230314.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS02\\210210_230325.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS03\\210210_230329.csv']
 ['c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS01\\210210_230540.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS02\\210210_230607.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS03\\210210_230601.csv']
 ['c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS01\\210210_230845.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS02\\210210_230850.csv'
  'c:\\Users\\mangnani\\Desktop\\Develop\\VS Code\\Python37\\[티스토리]__Pandas전처리\\JS03\\210210_230903.csv']]
  """

 

df_1 = pd.read_csv(np_path_trans[file_index][0])
df_2 = pd.read_csv(np_path_trans[file_index][1])
df_3 = pd.read_csv(np_path_trans[file_index][2])

print(df_1, df_2, df_3, sep="\n")

# result
       aCnt  RSSI
0       0.0 -73.2
1       1.0 -71.5
2       2.0 -72.5
3       3.0 -72.0
4       4.0 -73.0
..      ...   ...
988   996.0 -73.3
989   997.0 -73.3
990   998.0 -74.0
991   999.0 -72.7
992  1000.0 -73.0

[993 rows x 2 columns]
       aCnt  RSSI
0       0.0 -85.0
1       1.0 -84.0
2       2.0 -85.0
3       4.0 -81.0
4       5.0 -82.0
..      ...   ...
741   996.0 -84.0
742   997.0 -83.0
743   998.0 -81.5
744   999.0 -80.0
745  1000.0 -81.5

[746 rows x 2 columns]
      aCnt  RSSI
0      0.0 -80.5
1      1.0 -82.0
2      2.0 -81.0
3      3.0 -85.0
4      4.0 -81.0
..     ...   ...
759  982.0 -83.0
760  984.0 -83.0
761  990.0 -87.0
762  992.0 -86.0
763  999.0 -85.0

[764 rows x 2 columns]

데이터 프레임을 자세히 보면 세 개의 df 행의 갯수가 다름을 알 수 있다. ( 이는 데이터 특성상 aCnt가 없는 것을 의미한다.) 

 

각각 데이터 프레임 aCnt의 교집합을 찾아 병합할 것이다. 

df_tmp = pd.merge(df_1, df_2, on='aCnt', how='inner')
df_rssi = pd.merge(df_tmp, df_3, on='aCnt', how='inner')

df_rssi

# result
aCnt	RSSI_x	RSSI_y	RSSI
0	0.0	-73.2	-85.0	-80.5
1	1.0	-71.5	-84.0	-82.0
2	2.0	-72.5	-85.0	-81.0
3	4.0	-73.0	-81.0	-81.0
4	5.0	-73.0	-82.0	-83.5
...	...	...	...	...
586	982.0	-73.3	-83.3	-83.0
587	984.0	-73.3	-81.0	-83.0
588	990.0	-73.0	-83.0	-87.0
589	992.0	-73.7	-82.5	-86.0
590	999.0	-72.7	-80.0	-85.0
591 rows × 4 columns

 

Colum name을 알맞게 변경한다. 

 

df_rssi.rename(columns={"RSSI_x":"RSSI_1", "RSSI_y":"RSSI_2", "RSSI":"RSSI_3"}, inplace=True)

df_rssi

# result

aCnt	RSSI_1	RSSI_2	RSSI_3
0	0.0	-73.2	-85.0	-80.5
1	1.0	-71.5	-84.0	-82.0
2	2.0	-72.5	-85.0	-81.0
3	4.0	-73.0	-81.0	-81.0
4	5.0	-73.0	-82.0	-83.5
...	...	...	...	...
586	982.0	-73.3	-83.3	-83.0
587	984.0	-73.3	-81.0	-83.0
588	990.0	-73.0	-83.0	-87.0
589	992.0	-73.7	-82.5	-86.0
590	999.0	-72.7	-80.0	-85.0
591 rows × 4 columns

RSSI를 통한 Distance를 추가할 것이다. ( lambda 함수 사용 )

def lambda_distance(rssi, tp=-64):
    ratio = rssi*1.0/tp
    if ratio < 1.0:
        return pow(ratio, 10)
    else:
        distance = (0.89976)*pow(ratio,7.7095) + 0.111
        return distance

 

df_rssi['distance_1'] = df_rssi['RSSI_1'].apply(lambda rssi:lambda_distance(rssi))
df_rssi['distance_2'] = df_rssi['RSSI_2'].apply(lambda rssi:lambda_distance(rssi))
df_rssi['distance_3'] = df_rssi['RSSI_3'].apply(lambda rssi:lambda_distance(rssi))

df_rssi

# result
	aCnt	RSSI_1	RSSI_2	RSSI_3	distance_1	distance_2	distance_3
0	0.0	-73.2	-85.0	-80.5	2.645139	8.132164	5.384701
1	1.0	-71.5	-84.0	-82.0	2.225242	7.432724	6.191371
2	2.0	-72.5	-85.0	-81.0	2.464196	8.132164	5.642557
3	4.0	-73.0	-81.0	-81.0	2.592246	5.642557	5.642557
4	5.0	-73.0	-82.0	-83.5	2.592246	6.191371	7.103364
...	...	...	...	...	...	...	...
586	982.0	-73.3	-83.3	-83.0	2.671951	6.975277	6.786975
587	984.0	-73.3	-81.0	-83.0	2.671951	5.642557	6.786975
588	990.0	-73.0	-83.0	-87.0	2.592246	6.786975	9.707345
589	992.0	-73.7	-82.5	-86.0	2.781686	6.483119	8.889049
590	999.0	-72.7	-80.0	-85.0	2.514708	5.137370	8.132164
591 rows × 7 columns

df_rssi.to_csv(csv_name, mode='w', index=False)

 

728x90
반응형

+ Recent posts