Open In Colab

캐글과 연동하기

!pip install kaggle
!pip install --upgrade --force-reinstall --no-deps kaggle
from google.colab import files
files.upload()
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (5.0.2)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2021.10.8)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.62.3)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
     |████████████████████████████████| 58 kB 2.5 MB/s 
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... done
  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73051 sha256=540a3a7d36ae6106f20d1f7c29b1afe562ca44be8bd818181fa99f5c13aeecb8
  Stored in directory: /root/.cache/pip/wheels/62/d6/58/5853130f941e75b2177d281eb7e44b4a98ed46dd155f556dc5
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
    Uninstalling kaggle-1.5.12:
      Successfully uninstalled kaggle-1.5.12
Successfully installed kaggle-1.5.12
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving kaggle.json to kaggle.json
{'kaggle.json': b'{"username":"ksy1998","key":"ff1e945a67cd54bc7068e3afe4a03ad6"}'}
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c kaggle-survey-2021
Downloading kaggle-survey-2021.zip to /content
  0% 0.00/3.01M [00:00<?, ?B/s]
100% 3.01M/3.01M [00:00<00:00, 103MB/s]
!unzip kaggle-survey-2021.zip
Archive:  kaggle-survey-2021.zip
  inflating: kaggle_survey_2021_responses.csv  
  inflating: supplementary_data/kaggle_survey_2021_answer_choices.pdf  
  inflating: supplementary_data/kaggle_survey_2021_methodology.pdf  

데이터 불러오기

import gc # For Memory Optimization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # Not sure if I used this
from wordcloud import WordCloud
from scipy.stats import norm

# Some more necessary libraries (These are for drawing the image on the bar charts)
import matplotlib.font_manager as fm
from matplotlib.offsetbox import TextArea, DrawingArea, OffsetImage, AnnotationBbox
import matplotlib.image as mpimg

# To Avoid unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

# Since there are many columns, I would like to view them all
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 400)
df = pd.read_csv('kaggle_survey_2021_responses.csv')
df = df.iloc[1:,:] # The first row was describing the columns. Better to look at the description from the Metadata file provided
df.head(3).style.set_properties(**{"background-color": "#76c5d6","color": "black", "border-color": "black"})
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 Q7_Part_4 Q7_Part_5 Q7_Part_6 Q7_Part_7 Q7_Part_8 Q7_Part_9 Q7_Part_10 Q7_Part_11 Q7_Part_12 Q7_OTHER Q8 Q9_Part_1 Q9_Part_2 Q9_Part_3 Q9_Part_4 Q9_Part_5 Q9_Part_6 Q9_Part_7 Q9_Part_8 Q9_Part_9 Q9_Part_10 Q9_Part_11 Q9_Part_12 Q9_OTHER Q10_Part_1 Q10_Part_2 Q10_Part_3 Q10_Part_4 Q10_Part_5 Q10_Part_6 Q10_Part_7 Q10_Part_8 Q10_Part_9 Q10_Part_10 Q10_Part_11 Q10_Part_12 Q10_Part_13 Q10_Part_14 Q10_Part_15 Q10_Part_16 Q10_OTHER Q11 Q12_Part_1 Q12_Part_2 Q12_Part_3 Q12_Part_4 Q12_Part_5 Q12_OTHER Q13 Q14_Part_1 Q14_Part_2 Q14_Part_3 Q14_Part_4 Q14_Part_5 Q14_Part_6 Q14_Part_7 Q14_Part_8 Q14_Part_9 Q14_Part_10 Q14_Part_11 Q14_OTHER Q15 Q16_Part_1 Q16_Part_2 Q16_Part_3 Q16_Part_4 Q16_Part_5 Q16_Part_6 Q16_Part_7 Q16_Part_8 Q16_Part_9 Q16_Part_10 Q16_Part_11 Q16_Part_12 Q16_Part_13 Q16_Part_14 Q16_Part_15 Q16_Part_16 Q16_Part_17 Q16_OTHER Q17_Part_1 Q17_Part_2 Q17_Part_3 Q17_Part_4 Q17_Part_5 Q17_Part_6 Q17_Part_7 Q17_Part_8 Q17_Part_9 Q17_Part_10 Q17_Part_11 Q17_OTHER Q18_Part_1 Q18_Part_2 Q18_Part_3 Q18_Part_4 Q18_Part_5 Q18_Part_6 Q18_OTHER Q19_Part_1 Q19_Part_2 Q19_Part_3 Q19_Part_4 Q19_Part_5 Q19_OTHER Q20 Q21 Q22 Q23 Q24_Part_1 Q24_Part_2 Q24_Part_3 Q24_Part_4 Q24_Part_5 Q24_Part_6 Q24_Part_7 Q24_OTHER Q25 Q26 Q27_A_Part_1 Q27_A_Part_2 Q27_A_Part_3 Q27_A_Part_4 Q27_A_Part_5 Q27_A_Part_6 Q27_A_Part_7 Q27_A_Part_8 Q27_A_Part_9 Q27_A_Part_10 Q27_A_Part_11 Q27_A_OTHER Q28 Q29_A_Part_1 Q29_A_Part_2 Q29_A_Part_3 Q29_A_Part_4 Q29_A_OTHER Q30_A_Part_1 Q30_A_Part_2 Q30_A_Part_3 Q30_A_Part_4 Q30_A_Part_5 Q30_A_Part_6 Q30_A_Part_7 Q30_A_OTHER Q31_A_Part_1 Q31_A_Part_2 Q31_A_Part_3 Q31_A_Part_4 Q31_A_Part_5 Q31_A_Part_6 Q31_A_Part_7 Q31_A_Part_8 Q31_A_Part_9 Q31_A_OTHER Q32_A_Part_1 Q32_A_Part_2 Q32_A_Part_3 Q32_A_Part_4 Q32_A_Part_5 Q32_A_Part_6 Q32_A_Part_7 Q32_A_Part_8 Q32_A_Part_9 Q32_A_Part_10 Q32_A_Part_11 Q32_A_Part_12 Q32_A_Part_13 Q32_A_Part_14 Q32_A_Part_15 Q32_A_Part_16 Q32_A_Part_17 Q32_A_Part_18 Q32_A_Part_19 Q32_A_Part_20 Q32_A_OTHER Q33 Q34_A_Part_1 Q34_A_Part_2 Q34_A_Part_3 Q34_A_Part_4 Q34_A_Part_5 Q34_A_Part_6 Q34_A_Part_7 Q34_A_Part_8 Q34_A_Part_9 Q34_A_Part_10 Q34_A_Part_11 Q34_A_Part_12 Q34_A_Part_13 Q34_A_Part_14 Q34_A_Part_15 Q34_A_Part_16 Q34_A_OTHER Q35 Q36_A_Part_1 Q36_A_Part_2 Q36_A_Part_3 Q36_A_Part_4 Q36_A_Part_5 Q36_A_Part_6 Q36_A_Part_7 Q36_A_OTHER Q37_A_Part_1 Q37_A_Part_2 Q37_A_Part_3 Q37_A_Part_4 Q37_A_Part_5 Q37_A_Part_6 Q37_A_Part_7 Q37_A_OTHER Q38_A_Part_1 Q38_A_Part_2 Q38_A_Part_3 Q38_A_Part_4 Q38_A_Part_5 Q38_A_Part_6 Q38_A_Part_7 Q38_A_Part_8 Q38_A_Part_9 Q38_A_Part_10 Q38_A_Part_11 Q38_A_OTHER Q39_Part_1 Q39_Part_2 Q39_Part_3 Q39_Part_4 Q39_Part_5 Q39_Part_6 Q39_Part_7 Q39_Part_8 Q39_Part_9 Q39_OTHER Q40_Part_1 Q40_Part_2 Q40_Part_3 Q40_Part_4 Q40_Part_5 Q40_Part_6 Q40_Part_7 Q40_Part_8 Q40_Part_9 Q40_Part_10 Q40_Part_11 Q40_OTHER Q41 Q42_Part_1 Q42_Part_2 Q42_Part_3 Q42_Part_4 Q42_Part_5 Q42_Part_6 Q42_Part_7 Q42_Part_8 Q42_Part_9 Q42_Part_10 Q42_Part_11 Q42_OTHER Q27_B_Part_1 Q27_B_Part_2 Q27_B_Part_3 Q27_B_Part_4 Q27_B_Part_5 Q27_B_Part_6 Q27_B_Part_7 Q27_B_Part_8 Q27_B_Part_9 Q27_B_Part_10 Q27_B_Part_11 Q27_B_OTHER Q29_B_Part_1 Q29_B_Part_2 Q29_B_Part_3 Q29_B_Part_4 Q29_B_OTHER Q30_B_Part_1 Q30_B_Part_2 Q30_B_Part_3 Q30_B_Part_4 Q30_B_Part_5 Q30_B_Part_6 Q30_B_Part_7 Q30_B_OTHER Q31_B_Part_1 Q31_B_Part_2 Q31_B_Part_3 Q31_B_Part_4 Q31_B_Part_5 Q31_B_Part_6 Q31_B_Part_7 Q31_B_Part_8 Q31_B_Part_9 Q31_B_OTHER Q32_B_Part_1 Q32_B_Part_2 Q32_B_Part_3 Q32_B_Part_4 Q32_B_Part_5 Q32_B_Part_6 Q32_B_Part_7 Q32_B_Part_8 Q32_B_Part_9 Q32_B_Part_10 Q32_B_Part_11 Q32_B_Part_12 Q32_B_Part_13 Q32_B_Part_14 Q32_B_Part_15 Q32_B_Part_16 Q32_B_Part_17 Q32_B_Part_18 Q32_B_Part_19 Q32_B_Part_20 Q32_B_OTHER Q34_B_Part_1 Q34_B_Part_2 Q34_B_Part_3 Q34_B_Part_4 Q34_B_Part_5 Q34_B_Part_6 Q34_B_Part_7 Q34_B_Part_8 Q34_B_Part_9 Q34_B_Part_10 Q34_B_Part_11 Q34_B_Part_12 Q34_B_Part_13 Q34_B_Part_14 Q34_B_Part_15 Q34_B_Part_16 Q34_B_OTHER Q36_B_Part_1 Q36_B_Part_2 Q36_B_Part_3 Q36_B_Part_4 Q36_B_Part_5 Q36_B_Part_6 Q36_B_Part_7 Q36_B_OTHER Q37_B_Part_1 Q37_B_Part_2 Q37_B_Part_3 Q37_B_Part_4 Q37_B_Part_5 Q37_B_Part_6 Q37_B_Part_7 Q37_B_OTHER Q38_B_Part_1 Q38_B_Part_2 Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
1 910 50-54 Man India Bachelor’s degree Other 5-10 years Python R nan nan nan nan nan nan nan nan nan nan nan Python nan nan nan nan nan nan nan nan Vim / Emacs nan nan nan nan nan Colab Notebooks nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan A laptop nan Google Cloud TPUs nan nan nan nan 2-5 times Matplotlib Seaborn nan Ggplot / ggplot2 Shiny nan nan nan nan Leaflet / Folium nan nan 5-10 years Scikit-learn TensorFlow nan nan nan nan nan nan nan nan nan Caret nan nan nan nan nan nan Linear or Logistic Regression Decision Trees or Random Forests Gradient Boosting Machines (xgboost, lightgbm, etc) Bayesian Approaches nan Dense Neural Networks (MLPs, etc) Convolutional Neural Networks nan Recurrent Neural Networks nan nan nan General purpose image/video tools (PIL, cv2, skimage, etc) nan nan nan nan nan nan Word embeddings/vectors (GLoVe, fastText, word2vec) nan nan nan nan nan Manufacturing/Fabrication 50-249 employees 3-4 No (we do not use ML methods) nan nan nan nan nan nan None of these activities are an important part of my role at work nan 25,000-29,999 $100-$999 nan nan Google Cloud Platform (GCP) nan nan nan nan nan nan nan nan nan nan nan nan Google Cloud Compute Engine nan nan nan nan nan nan Google Cloud Storage (GCS) nan nan nan nan nan nan nan nan nan nan nan No / None nan nan PostgreSQL nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan None nan nan nan nan nan nan nan nan No / None nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan No / None nan nan nan nan GitHub nan Kaggle nan nan nan nan Coursera edX Kaggle Learn Courses DataCamp nan Udacity Udemy nan nan nan nan nan Local development environments (RStudio, JupyterLab, etc.) nan Email newsletters (Data Elixir, O'Reilly Data & AI, etc) nan Kaggle (notebooks, forums, etc) nan YouTube (Kaggle YouTube, Cloud AI Adventures, etc) Podcasts (Chai Time Data Science, O’Reilly Data Show, etc) Blogs (Towards Data Science, Analytics Vidhya, etc) Journal Publications (peer-reviewed journals, conference proceedings, etc) nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
2 784 50-54 Man Indonesia Master’s degree Program/Project Manager 20+ years nan nan SQL C C++ Java nan nan nan nan nan nan nan Python nan nan nan nan nan nan Notepad++ nan nan nan Jupyter Notebook nan nan Kaggle Notebooks Colab Notebooks nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc) nan nan nan nan None nan Never Matplotlib nan nan nan nan nan nan nan nan nan nan nan Under 1 year Scikit-learn nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Linear or Logistic Regression Decision Trees or Random Forests nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Manufacturing/Fabrication 1000-9,999 employees 1-2 We are exploring ML methods (and may one day put a model into production) nan Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data nan nan nan nan nan nan 60,000-69,999 $0 ($USD) nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Kaggle Learn Courses nan nan nan nan nan Cloud-certification programs (direct from AWS, Azure, GCP, or similar) University Courses (resulting in a university degree) nan nan Advanced statistical software (SPSS, SAS, etc.) nan nan nan nan nan nan nan nan Journal Publications (peer-reviewed journals, conference proceedings, etc) nan nan nan nan nan Google Cloud Platform (GCP) nan Oracle Cloud nan nan nan nan nan nan nan nan nan Google Cloud Compute Engine nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan None nan MySQL nan SQLite Oracle Database nan nan nan nan nan nan nan nan nan nan nan Google Cloud SQL nan nan nan nan nan nan nan Google Data Studio nan nan nan nan Qlik nan nan nan nan nan nan nan nan nan nan nan Automated model selection (e.g. auto-sklearn, xcessiv) nan nan nan nan nan Google Cloud AutoML nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan None nan
3 924 22-24 Man Pakistan Master’s degree Software Engineer 1-3 years Python nan nan nan C++ Java nan nan nan nan nan nan nan Python nan nan nan nan PyCharm nan nan nan nan nan Jupyter Notebook nan Other Kaggle Notebooks nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan A laptop nan nan nan nan nan Other Never Matplotlib nan nan nan nan nan nan nan nan nan nan nan I do not use machine learning methods nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Academics/Education 1000-9,999 employees 0 I do not know nan nan nan nan nan nan None of these activities are an important part of my role at work nan $0-999 $0 ($USD) nan nan nan nan nan nan nan nan nan nan None nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan DataRobot nan nan nan nan nan nan MySQL nan nan nan MongoDB nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan MySQL nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan None nan nan nan nan nan nan nan nan No / None nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan No / None nan nan nan nan nan nan nan nan nan I do not share my work publicly nan nan nan nan DataCamp nan nan nan nan nan nan nan nan Basic statistical software (Microsoft Excel, Google Sheets, etc.) nan nan nan Kaggle (notebooks, forums, etc) nan YouTube (Kaggle YouTube, Cloud AI Adventures, etc) nan nan nan nan nan nan Amazon Web Services (AWS) nan Google Cloud Platform (GCP) nan nan nan nan nan nan nan nan nan nan Microsoft Azure Virtual Machines Google Cloud Compute Engine nan nan nan nan nan nan nan nan nan nan nan Azure Machine Learning Studio Google Cloud Vertex AI DataRobot nan nan nan nan nan nan MySQL PostgreSQL nan nan MongoDB nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Microsoft Power BI nan nan nan Tableau nan nan nan nan nan nan nan nan nan nan nan nan nan nan Automated model selection (e.g. auto-sklearn, xcessiv) nan nan nan nan nan nan nan nan DataRobot AutoML nan nan nan nan nan nan nan nan TensorBoard nan nan nan nan nan nan nan
print('Number of rows:', df.shape[0])
print('Number of columns:', df.shape[1])
Number of rows: 25973
Number of columns: 369

첫번째 질문 : 나이

df['Q1'].value_counts()
25-29    4931
18-21    4901
22-24    4694
30-34    3441
35-39    2504
40-44    1890
45-49    1375
50-54     964
55-59     592
60-69     553
70+       128
Name: Q1, dtype: int64
fig, ax = plt.subplots(figsize=(25,10), facecolor="w")

# Method for image
def make_img(img,zoom, x, y):
    img = mpimg.imread(img)
    imagebox = OffsetImage(img, zoom=zoom)
    ab = AnnotationBbox(imagebox, (x,y),frameon=False)
    ax.add_artist(ab)

img_file = "https://www.freeiconspng.com/thumbs/crown-icon/queen-crown-icon-4.png"
zoom = 1
img_y= 4.8

# Creating a DataFrame to get the values and their counts (this was for my purpose)
# new_df = pd.DataFrame(df['Q1'].value_counts())

# I wanted to have the highest value in the middle, so i wrote the following two code lines
age_bucket = ['70+','55-59','45-49','35-39','22-24','25-29','18-21','30-34','40-44','50-54','60-69']   #new_df.index
age_bucket_cnt = [128,592,1375,2504,4694,4931,4901,3441,1890,964,553]   #list(new_df.Q1.values)

color = ['#E6E6E6', '#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6'] # Deciding the color
width = [0.8, 0.8, 0.8, 0.8, 0.9, 0.9, 0.9, 0.8, 0.8, 0.8, 0.8] # The Width
alpha = [0.3, 0.45, 0.5, 0.6, 0.75, 1.0, 0.75, 0.6, 0.5, 0.45, 0.3] # The Opacity

fontsize= [20, 20, 20, 20, 25, 35, 30, 20, 20, 20, 20]
x_num = [0,1,2,3,4,5,6,7,8,9,10]

for i in range(11):
    plt.bar(x=age_bucket[i],height=age_bucket_cnt[i], width=width[i], color=color[i], alpha=alpha[i])
    plt.text(s=age_bucket[i],x=x_num[i],y=age_bucket_cnt[i],va='bottom',ha='center',fontsize=fontsize[i], alpha=alpha[i])
    plt.text(s="Age Bucket of all Kagglers",x=5,y=5500, fontsize=50,va='bottom',ha='center',color='#189AB4')

# Placing the image
make_img(img_file,0.2, 5, 4700)    
    
gc.collect() # For Memory Optimization

plt.axis('off')
plt.show()

확실히 대학생이나 취업 준비생이 많이 이용하는 느낌이다.

다만 18-21세 연령대 이용률이 생각보다 높은 것이 신기했다.

두번째 질문: 성별

df['Q2'].value_counts()
Man                        20598
Woman                       4890
Prefer not to say            355
Nonbinary                     88
Prefer to self-describe       42
Name: Q2, dtype: int64
Gender = ['Man', 'Woman', 'Others']
  
# Setting size in Chart based on 
# given values
Gender_cnt = [20598, 4890, 485]
  
# colors
colors = ['#E6E6E6', '#189AB4', '#FFFF00', 
          '#ADFF2F', '#FFA500']
# explosion
explode = (0.05, 0.05, 0.2)
  
    
plt.figure(figsize=[20,10])    
# Pie Chart
plt.pie(Gender_cnt, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2,
        explode=explode,)
  
# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()

plt.legend(Gender, loc = "upper right",title="Genders", prop={'size': 15})
     
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)
  
plt.rcParams['font.size'] = 25    
# Adding Title of chart
plt.text(s="Gender Diversity in Kaggle",x=0,y=1.3, fontsize=50,va='bottom',ha='center',color='#189AB4')
  
gc.collect()    
# Displaing Chart
plt.show()

남자가 약 80%, 여자가 약 18%이고 기타 이유(공개 희망 안함, 미 기제 등) 2% 입니다.

확실히 남성이 주류인 분야인 것 같습니다.

세번째 질문: 국적

df['Q3'].value_counts()
India                                                   7434
United States of America                                2650
Other                                                   1270
Japan                                                    921
China                                                    814
Brazil                                                   751
Russia                                                   742
Nigeria                                                  702
United Kingdom of Great Britain and Northern Ireland     550
Pakistan                                                 530
Egypt                                                    482
Germany                                                  470
Spain                                                    454
Indonesia                                                444
Turkey                                                   416
France                                                   401
South Korea                                              359
Taiwan                                                   334
Canada                                                   331
Bangladesh                                               317
Italy                                                    311
Mexico                                                   279
Viet Nam                                                 277
Australia                                                264
Kenya                                                    248
Colombia                                                 225
Poland                                                   219
Iran, Islamic Republic of...                             195
Ukraine                                                  186
Singapore                                                182
Argentina                                                182
Malaysia                                                 156
Netherlands                                              153
South Africa                                             146
Morocco                                                  140
Israel                                                   138
Thailand                                                 123
Portugal                                                 119
Peru                                                     117
United Arab Emirates                                     111
Tunisia                                                  109
Philippines                                              108
Sri Lanka                                                106
Chile                                                    102
Greece                                                   102
Ghana                                                     99
Saudi Arabia                                              89
Ireland                                                   84
Sweden                                                    81
Hong Kong (S.A.R.)                                        79
Nepal                                                     75
Switzerland                                               71
I do not wish to disclose my location                     69
Belgium                                                   65
Czech Republic                                            63
Romania                                                   61
Austria                                                   51
Belarus                                                   51
Ecuador                                                   50
Denmark                                                   48
Uganda                                                    47
Norway                                                    45
Kazakhstan                                                45
Algeria                                                   44
Ethiopia                                                  43
Iraq                                                      43
Name: Q3, dtype: int64
!pip install geopandas
import geopandas as gpd

# List of countries we are interested in
lis_countries = ["Algeria","Argentina","Australia","Austria","Bangladesh","Belarus","Belgium","Brazil","Canada","Chile","China","Colombia",
                 "Czechia","Denmark","Ecuador","Egypt","Ethiopia","France","Germany","Ghana","Greece","India","Indonesia","Iraq","Ireland",
                 "Israel","Italy","Japan","Kazakhstan","Kenya","Malaysia","Mexico","Morocco","Nepal","Netherlands","Nigeria","Norway","Pakistan",
                 "Peru","Philippines","Poland","Portugal","Romania","Russia","Saudi Arabia","South Africa","South Korea","Spain","Sri Lanka",
                 "Sweden","Switzerland","Taiwan","Thailand","Tunisia","Turkey","Uganda","Ukraine","United Arab Emirates","United Kingdom",
                 "United States of America","Vietnam"]

# Reading the geopandas data 
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

country_data = lis_countries # Passing the list of countries here
country_geo = list(world['name']) # The country list from the geopandas dataset

# List of all the values of population of Kagglers from each country
lis_pop = [44,182,264,51,317,51,65,751,331,102,814,225,63,48,50,482,43,401,470,99,102,7434,444,43,84,138,311,921,45,248,156,279,140,75,153,
           702,45,530,117,108,219,119,61,742,89,146,359,454,106,81,71,334,123,109,416,47,186,111,550,2650,277]

# Next we need to create a dataframe with lis_countries and lis_pop
our_country_analysis = pd.DataFrame(lis_countries, columns=['Country'])
our_country_analysis['KagglePopulation'] = lis_pop

# Next, we are going to visualize this...
mapped = world.set_index('name').join(our_country_analysis.set_index('Country')).reset_index()

to_be_mapped = 'KagglePopulation'
vmin, vmax = 0,10000
fig, ax = plt.subplots(1, figsize=(25,30))

mapped.dropna().plot(column=to_be_mapped, cmap='cividis', linewidth=0.8, ax=ax, edgecolors='1', alpha=0.7)

ax.text(s="Kagglers All Around the Globe",x=0,y=100, fontsize=50,va='bottom',ha='center',color='#189AB4')
ax.set_axis_off()

sm = plt.cm.ScalarMappable(cmap='cividis', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []

gc.collect()
cbar = fig.colorbar(sm, orientation='vertical', shrink= .25)
Requirement already satisfied: geopandas in /usr/local/lib/python3.7/dist-packages (0.10.2)
Requirement already satisfied: fiona>=1.8 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.8.20)
Requirement already satisfied: shapely>=1.6 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.8.0)
Requirement already satisfied: pyproj>=2.2.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (3.2.1)
Requirement already satisfied: pandas>=0.25.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.1.5)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2021.10.8)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.15.0)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (21.2.0)
Requirement already satisfied: munch in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (7.1.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2018.9)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2.8.2)

포화도가 높을 수록 노란색에 가까워 지는 것을 알 수 있습니다.

인도 사람들이 확실히 많이 이용하는 모습이군요.

중간중간 하얗게 빈 나라들도 있습니다.

네번째 질문: 학력

df['Q4'].value_counts()
Master’s degree                                                      10132
Bachelor’s degree                                                     9907
Doctoral degree                                                       2795
Some college/university study without earning a bachelor’s degree     1735
I prefer not to answer                                                 627
No formal education past high school                                   417
Professional doctorate                                                 360
Name: Q4, dtype: int64
fig, ax = plt.subplots(figsize=(25,10), facecolor="w")

# Method for image
def make_img(img,zoom, x, y):
    img = mpimg.imread(img)
    imagebox = OffsetImage(img, zoom=zoom)
    ab = AnnotationBbox(imagebox, (x,y),frameon=False)
    ax.add_artist(ab)

img_file = "https://www.freeiconspng.com/thumbs/crown-icon/queen-crown-icon-4.png"
zoom = 1
img_y= 4.8



# I wanted to have the highest value in the middle, so i wrote the following two code lines
age_bucket = ['Professional Doctorate','High School','Bachelor’s degree','Master’s degree','Doctoral degree','Others','No Answer']  
age_bucket_cnt = [360,417,9907,10132,2795,1735,627]  

color = ['#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6'] # Deciding the color
width = [0.8, 0.8, 0.9, 0.9, 0.9, 0.8, 0.8,] # The Width
alpha = [0.5, 0.6, 0.75, 1.0, 0.75, 0.6, 0.5] # The Opacity

fontsize= [12, 16, 18, 21, 16, 16, 16]
x_num = [0,1,2,3,4,5,6]

for i in range(7):
    plt.bar(x=age_bucket[i],height=age_bucket_cnt[i], width=width[i], color=color[i], alpha=alpha[i])
    plt.text(s=age_bucket[i],x=x_num[i],y=age_bucket_cnt[i],va='bottom',ha='center',fontsize=fontsize[i], alpha=alpha[i])
    plt.text(s="Educational Qualifications of all Kagglers",x=3,y=11000, fontsize=50,va='bottom',ha='center',color='#189AB4')

# Placing the image
make_img(img_file,0.25, 3, 9500)      
    
gc.collect() # For Memory Optimization

plt.axis('off')
plt.show()

대부분의 캐글 이용자들은 학사 이상의 학위를 가지고 있습니다.

(Master's degree : 석사, Bachelor's degree : 학사, Doctoral degree : 박사 학위)

다섯번째 질문: 직업

df['Q5'].value_counts()
Student                         6804
Data Scientist                  3616
Software Engineer               2449
Other                           2393
Data Analyst                    2301
Currently not employed          1986
Research Scientist              1538
Machine Learning Engineer       1499
Business Analyst                 968
Program/Project Manager          849
Data Engineer                    668
Product Manager                  319
Statistician                     313
DBA/Database Engineer            171
Developer Relations/Advocacy      99
Name: Q5, dtype: int64
# Method for image
def make_img(img,zoom, x, y):
    img = mpimg.imread(img)
    imagebox = OffsetImage(img, zoom=zoom)
    ab = AnnotationBbox(imagebox, (x,y),frameon=False)
    ax.add_artist(ab)

img_file = "https://www.freeiconspng.com/thumbs/crown-icon/queen-crown-icon-4.png"
zoom = 1
img_y= 4.8

fig, ax = plt.subplots(figsize=(25,10), facecolor="w")

# Creating a DataFrame to get the values and their counts (this was for my purpose)
# new_df = pd.DataFrame(df['Q1'].value_counts())

# I wanted to have the highest value in the middle, so i wrote the following two code lines
age_bucket = ['Developer\n Relations\n/Advocacy','Statistician','Data\n Engineer','Business\n Analyst','Research\n Scientist','Data\n Analyst','Software\n Engineer','Student',
              'Data\n Scientist','Other','Unemployed','ML\n Engineer','Project\n Manager','Product\n Manager','DB\n Engineer']   #new_df.index
age_bucket_cnt = [99,313,668,968,1538,2301,2449,6804,3414,2393,1986,1499,849,319,171]   #list(new_df.Q1.values)

color = ['#E6E6E6', '#189AB4', '#E6E6E6', '#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4', '#E6E6E6'] # Deciding the color
width = [0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.9, 0.9, 0.9, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8] # The Width
alpha = [0.3, 0.45, 0.3, 0.45, 0.5, 0.6, 0.75, 1.0, 0.75, 0.6, 0.5, 0.45, 0.3, 0.3, 0.45] # The Opacity

fontsize= [12, 12, 14, 14, 14, 14, 18, 20, 16, 14, 12, 14, 14, 12, 12]
x_num = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]

for i in range(15):
    plt.bar(x=age_bucket[i],height=age_bucket_cnt[i], width=width[i], color=color[i], alpha=alpha[i])
    plt.text(s=age_bucket[i],x=x_num[i],y=age_bucket_cnt[i],va='bottom',ha='center',fontsize=fontsize[i], alpha=alpha[i])
    plt.text(s="Current Role of all Kagglers",x=7.5,y=7500, fontsize=50,va='bottom',ha='center',color='#189AB4')

# Placing the image
make_img(img_file,0.15, 7, 6500)        
    
gc.collect() # For Memory Optimization

plt.axis('off')
plt.show()

작성자의 예측과 다르게 학생이 압도적으로 높은 수치가 나왔습니다.

(대부분이 ML 전문가나 데이터 분석가가 나올것이라고 생각한 것 같아요.)

여기서 주목할 점이 Others 입니다. 꽤 상위권에 위치하는데요.

타 분야 사람이 캐글 이용에 적극적인 것으로 생각할 수 있는데요. 데이터 분석이 많은 분야에서 응용될 수 있다는 것을 보여주는 것 같아요.

여섯번째 질문: 프로그래밍 경력

df['Q6'].value_counts()
1-3 years                    7874
< 1 years                    5881
3-5 years                    4061
5-10 years                   3099
10-20 years                  2166
20+ years                    1860
I have never written code    1032
Name: Q6, dtype: int64
years_bin = ['1-3years','<1years','3-5years','5-10years','10-20years','20+years','Never Coded']
years_cnt = [7874, 5881, 4061, 3099, 2166, 1860, 1032]

fig = plt.figure(figsize=(20,10))
plt.barh(width=years_cnt, y=years_bin, height=0.7, color = ['#189AB4', '#189AB4','#189AB4','#E6E6E6','#E6E6E6', '#E6E6E6', '#E6E6E6'], alpha=0.8)

##################### For the Years of Experience ###################################
s1 = ['1-3years','<1years','3-5years','5-10years','10-20years','20+years','Never Coded']
x1 = [8874, 6881, 5061, 4099, 3366, 2860, 2432]
y1 = [0,1,2,3,4,5,6]


for i in range(7):
    plt.text(s = s1[i], x=x1[i], y=y1[i] ,fontsize=25,va='center',ha='right',alpha=0.8)

plt.title("Average Years of Programming Experience of Kagglers", fontsize=42, pad=20, color='#189AB4')
plt.axis('off')
plt.gca().invert_yaxis()
plt.show()

캐글 내에 생각보다 코딩 경력이 오래된 사람이 많지 않습니다.

젏은 플렛폼이라고도 생각할 수 있고, 초보자가 접근하기 어렵지 않다고도 생각할 수 있겠네요.

일곱번째 질문: 프로그래밍 언어

df['Q7_Part_1'].value_counts()
Python    21860
Name: Q7_Part_1, dtype: int64
df['Q7_Part_2'].value_counts()
R    5334
Name: Q7_Part_2, dtype: int64
Tool = ['Python', 'R']
  
# Setting size in Chart based on 
# given values
Tool_cnt = [21860, 5334]
  
# colors
colors = ['#E6E6E6', '#189AB4']

# explosion
explode = (0.05, 0.05)
  

plt.figure(figsize=[20,10])   

# Pie Chart
plt.pie(Tool_cnt, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2,
        explode=explode,)
  
# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()

plt.legend(Tool, loc = "upper right",title="Programming Languages", prop={'size': 15})
     
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)
  
plt.rcParams['font.size'] = 25    
# Adding Title of chart
plt.text(s="Which Programming Tool do they Prefer?",x=0,y=1.3, fontsize=50,va='bottom',ha='center',color='#189AB4')
  
gc.collect()    
# Displaing Chart
plt.show()

파이썬과 R 이외에 다른 선택지도 있었고, 중복 선택이 허용된 문항이지만 작성자는 파이썬과 R만을 비교했습니다.

파이썬이 80% 이상으로 압도적인 사용률을 보였는데요.

앞서 조사한 결과에서 학생인 사람이 많고, 타 분야 전문가도 많기 때문에 쉬운 언어인 파이썬의 사용률이 높지 않을까 생각했어요.

여덟번째 질문: 프로그래밍 언어2

df['Q8'] = df['Q8'].apply(lambda x: 'Others' if x not in ['Python','R','SQL'] else x)
df['Q8'].value_counts()
Python    20213
Others     2977
R          1445
SQL        1338
Name: Q8, dtype: int64
Tool = ['Python', 'R', 'SQL', 'Others']
  
# Setting size in Chart based on 
# given values
Tool_cnt = [20213, 1445, 1338, 2977]
  
# colors
colors = ['#E6E6E6', '#189AB4', '#FFFF00', '#ADFF2F']

# explosion
explode = (0.05, 0.05, 0.05, 0.05)
  

plt.figure(figsize=[20,10])   

# Pie Chart
plt.pie(Tool_cnt, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2,
        explode=explode,)
  
# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()

plt.legend(Tool, loc = "upper right",title="Programming Languages", prop={'size': 15})
     
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)
  
plt.rcParams['font.size'] = 25    
# Adding Title of chart
plt.text(s="What do they Recommend for Data Science?",x=0,y=1.3, fontsize=50,va='bottom',ha='center',color='#189AB4')
  
gc.collect()    
# Displaing Chart
plt.show()

앞선 조사와 비슷한데, 차이점은 중복선택이 안된다는 점입니다.

선택지가 꽤 많았는데도 파이썬이 압도적인 선택률을 보이네요.

아홉번째 질문: 프로그래밍 환경(IDE)

df['Q9_Part_1'].value_counts()
Jupyter (JupyterLab, Jupyter Notebooks, etc)     5488
Name: Q9_Part_1, dtype: int64
df['Q9_Part_2'].value_counts()
 RStudio     4771
Name: Q9_Part_2, dtype: int64

이런식으로 값을 추출해서 적용한 것 같아요.

name = ['JupyterLab','RStudio','Visual Studio','VS Code','PyCharm','Spyder','Notepad++','Sublime Text','Vim/Emacs','MATLAB','Jupyter Notebook','None','Other']
value = [5488,4771,4110,10040,7468,3794,3937,2839,1646,2203,16233,526,1491]

# Creating a dataframe to store this information
df_nine_ = pd.DataFrame(name, columns=['IDE'])
df_nine_['Values'] = value
df_nine_ = df_nine_.sort_values(by="Values", ascending=False)
df_nine_

fig = plt.figure(figsize=(20,10))
plt.barh(width=list(df_nine_['Values'].unique()), y=list(df_nine_['IDE'].unique()), height=0.7, color = ['#189AB4', '#189AB4', '#189AB4', '#E6E6E6','#E6E6E6','#E6E6E6','#E6E6E6', '#E6E6E6', '#E6E6E6', '#E6E6E6', '#E6E6E6', '#E6E6E6', '#E6E6E6'], alpha=0.8)

##################### For the Years of Experience ###################################
s1 = list(df_nine_['IDE'].unique())
x1 = [19833,12040,9468,7788,6471,6810,6437,5294,5539,4003,3946,2691,1726]
y1 = [0,1,2,3,4,5,6,7,8,9,10,11,12]


for i in range(13):
    plt.text(s = s1[i], x=x1[i], y=y1[i] , fontsize=25,va='center',ha='right',alpha=0.8)


plt.title("Preferred IDE of Kagglers", fontsize=42, pad=20, color='#189AB4')
plt.axis('off')
plt.gca().invert_yaxis()
gc.collect()
plt.show()

주피터 노트북이 사용자 친화적이라고 코멘트를 합니다. 시프트+엔터시 결과물이 바로 나와 편리하다는 근거와 함께.

VS CODE는 다른 언어(C) 할때 저도 사용했는데, 깃허브와 연동이 좋아서 사용이 편리합니다. 역시 많은 사용자가 이용하는 것 같아요.

파이참도 저는 써보진 않았지만 높은 순위를 기록합니다.

R을 사용하는 사람 비율 대비 R스튜디오도 많이 쓰는 모습을 보이는데, 대부분에 R 사용자가 R스튜디오를 사용한다고 생각됩니다.

열번째 질문: 주 사용 노트북

df['Q10_Part_1'].value_counts()
 Kaggle Notebooks    9507
Name: Q10_Part_1, dtype: int64
df['Q10_Part_2'].value_counts()
Colab Notebooks    9792
Name: Q10_Part_2, dtype: int64

코랩 노트북, 캐글 노트북 이용자 이외는 Other로 생각한 것 같습니다.

def make_img(img,zoom, x, y):
    img = mpimg.imread(img)
    imagebox = OffsetImage(img, zoom=zoom)
    ab = AnnotationBbox(imagebox, (x,y),frameon=False)
    ax.add_artist(ab)

img_file = "https://www.freeiconspng.com/thumbs/crown-icon/queen-crown-icon-4.png"
zoom = 1
img_y= 4.8


# Visualizing the Hosted Notebooks. (Hidden Input)

fig, ax = plt.subplots(figsize=(25,10), facecolor="w")


age_bucket = ['None','Colab Notebook','Kaggle Notebook']   
age_bucket_cnt = [7174,9792,9507]  

color = ['#E6E6E6','#189AB4','#E6E6E6'] # Deciding the color
width = [0.9, 0.9, 0.9] # The Width
alpha = [0.55, 1.0, 0.75] # The Opacity

fontsize= [25, 45, 30]
x_num = [0,1,2]

for i in range(3):
    plt.bar(x=age_bucket[i],height=age_bucket_cnt[i], width=width[i], color=color[i], alpha=alpha[i])
    plt.text(s=age_bucket[i],x=x_num[i],y=age_bucket_cnt[i],va='bottom',ha='center',fontsize=fontsize[i], alpha=alpha[i])
    plt.text(s="Preferred Hosted Notebooks",x=1,y=11000, fontsize=50,va='bottom',ha='center',color='#189AB4')

# Placing the image
make_img(img_file,0.3, 1, 9000)    
    
gc.collect() # For Memory Optimization

plt.axis('off')
plt.show()

코랩 노트북과 캐글 노트북의 사용자 수가 비슷합니다.

코랩 노트북은 점유율 1위로, GPU 사용이 일부 가능하고 구글 드라이브와 연동이 잘된다는 점을 큰 장점으로 소개합니다.

물론 캐글 이용자 조사이기 때문에 캐글 데이터와 캐글 노트북 간 호완성, 접근성이 좋아서 캐글 노트북 사용자가 다소 많이 집계됬습니다.

다만 캐글 노트북 만에 분명한 장점이 있겠죠? 한번 어느 환경인지 기회될때 탐색하는 것도 좋을 것 같아요.

또 특이한 점은 두 노트북 이외 각자의 PC환경을 사용하는 사람도 꽤 많다는 것입니다.

열한번째 질문: 가속기 유무

df['Q12_Part_1'].value_counts()
 NVIDIA GPUs     8036
Name: Q12_Part_1, dtype: int64
df['Q12_Part_2'].value_counts()
 Google Cloud TPUs     3451
Name: Q12_Part_2, dtype: int64
df['Q12_Part_3'].value_counts()
 AWS Trainium Chips     414
Name: Q12_Part_3, dtype: int64
df['Q12_Part_4'].value_counts()
 AWS Inferentia Chips     416
Name: Q12_Part_4, dtype: int64
df['Q12_Part_5'].value_counts()
None    13234
Name: Q12_Part_5, dtype: int64
df['Q12_OTHER'].value_counts()
Other    867
Name: Q12_OTHER, dtype: int64
name = ["None","NVIDIA GPUs","Google Cloud TPUs","Other","AWS Inferentia Chips","AWS Trainium Chips"]
count = [13234,8036,3451,867,416,414]

# Visualizing using a barh:
fig = plt.figure(figsize=(20,10))
plt.barh(width=count, y=name, height=0.7, color = ['#E6E6E6', '#189AB4', '#189AB4', '#E6E6E6','#E6E6E6','#E6E6E6'], alpha=0.8)

##################### For the Years of Experience ###################################
s1 = name
x1 = [14234,10236,6651,2067,3916,3714]
y1 = [0,1,2,3,4,5]


for i in range(6):
    plt.text(s = s1[i], x=x1[i], y=y1[i] , fontsize=25,va='center',ha='right',alpha=0.8)

plt.title("Specialized Hardware", fontsize=42, pad=20, color='#189AB4')
plt.axis('off')
plt.gca().invert_yaxis()
gc.collect()
plt.show()

GPU나 TPU를 사용하지 않는 캐글 사용자가 상당히 많이 있네요.

느낀점

대회참가를 위한 데이터 공부가 아니라 설문조사를 시각화 하는 공부였습니다.

이쁘게 시각화 하기 위해서 작성자가 다양하게 노력한 모습을 확인했습니다.

또한 설문조사가 캐글 이용자 관련 설문조사라서 결과에 대해 더 흥미롭게 확인 한 것 같아요.

가볍게 공부하기 좋은 데이터 셋인것 같습니다.

대회 출처 : https://www.kaggle.com/c/kaggle-survey-2021

코드 출처 : https://www.kaggle.com/vivek468/what-s-up-kaggle-kaggle-survey-2021