[실습] 심화 실습

1) Box Plot 활용

<aside>

total_spent 컬럼에 대해 회원 등급별 (membership_level)로 Box Plot을 작성하고, 이상치를 분석하세요. 이상치가 있는 데이터를 별도로 추출하여 outliers.csv로 저장하세요.

</aside>

import pandas as pd
import seaborn as sns

# 데이터 불러오기
data = pd.read_csv('user_purchase_data.csv')

# 등급별 total_spent Box Plot
total_spent=data['total_spent']
membership_level = data['membership_level']
sns.boxplot(x=membership_level, y=total_spent)

# 이상치 추출
q1 = total_spent.quantile(0.25)
q3 = total_spent.quantile(0.75)
iqr = q3 - q1

boundary = 1.5*iqr

outlier = data[(total_spent > q3 + boundary) | (total_spent < q1 - boundary)]

# 이상치 데이터 csv로 저장
outlier.to_csv("outlier.csv")

2) Scatter Plot 활용

<aside>

ad_spend와 total_spent 컬럼을 사용하여 Scatter Plot을 작성하고, 두 변수 간의 관계를 분석하세요. 광고비 지출이 총 지출 금액에 미치는 영향을 분석하세요.

</aside>

import matplotlib.pyplot as plt

# 데이터 로드 
data = pd.read_csv("user_purchase_data.csv")

# 컬럼 변수 할당
ad_spend = data['ad_spend']
total_spent = data['total_spent']

# Scatter Plot
plt.scatter(ad_spend, total_spent)

# 상관계수 출력 : 0.018376 - 상관관계 거의 없음
data[['ad_spend', 'total_spent']].corr(numeric_only=True)

3) 상관관계 분석

<aside>

모든 수치형 컬럼 간의 상관관계를 계산하고, 어떤 변수들이 높은 상관관계를 가지는지 분석하세요.

</aside>

# 데이터 로드
data = pd.read_csv('./user_purchase_data.csv')

# 상관계수 테이블 형태로 출력
corr_data = data.corr(numeric_only=True)

# heatmap 생성
sns.heatmap(corr_data, annot=True, cmap='crest')

# 높은 상관관계
high_corr = corr_data[corr_data.abs() > 0.7]
print(high_corr)

결과 이미지
DataFrame.abs()

4) 이상치(outlier) 식별

<aside>

price 컬럼과 total_spent 컬럼의 아웃라이어를 식별하세요. IQR 방법을 사용하세요.

</aside>

# 데이터 로드
data = pd.read_csv('user_purchase_data.csv')

# 이상치 식별 함수
def outlier (data, column):
    q1 = column.quantile(0.25)
    q3 = column.quantile(0.75)
    iqr = q3 - q1

    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)

    return data[(lower_bound > column) | (upper_bound < column)]

# price, total_spent 이상치 식별
price_outlier = outlier(data, data['price'])
total_spent_outlier = outlier(data, data['total_spent'])