캐글과 연동하기

!pip install kaggle
!pip install --upgrade --force-reinstall --no-deps kaggle
from google.colab import files
files.upload()

Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (5.0.2)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2021.10.8)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.62.3)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
     |████████████████████████████████| 58 kB 2.5 MB/s 
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... done
  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73051 sha256=540a3a7d36ae6106f20d1f7c29b1afe562ca44be8bd818181fa99f5c13aeecb8
  Stored in directory: /root/.cache/pip/wheels/62/d6/58/5853130f941e75b2177d281eb7e44b4a98ed46dd155f556dc5
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
    Uninstalling kaggle-1.5.12:
      Successfully uninstalled kaggle-1.5.12
Successfully installed kaggle-1.5.12

Saving kaggle.json to kaggle.json

{'kaggle.json': b'{"username":"ksy1998","key":"ff1e945a67cd54bc7068e3afe4a03ad6"}'}

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle competitions download -c kaggle-survey-2021

Downloading kaggle-survey-2021.zip to /content
  0% 0.00/3.01M [00:00<?, ?B/s]
100% 3.01M/3.01M [00:00<00:00, 103MB/s]

!unzip kaggle-survey-2021.zip

Archive:  kaggle-survey-2021.zip
  inflating: kaggle_survey_2021_responses.csv  
  inflating: supplementary_data/kaggle_survey_2021_answer_choices.pdf  
  inflating: supplementary_data/kaggle_survey_2021_methodology.pdf

데이터 불러오기

import gc # For Memory Optimization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # Not sure if I used this
from wordcloud import WordCloud
from scipy.stats import norm

# Some more necessary libraries (These are for drawing the image on the bar charts)
import matplotlib.font_manager as fm
from matplotlib.offsetbox import TextArea, DrawingArea, OffsetImage, AnnotationBbox
import matplotlib.image as mpimg

# To Avoid unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

# Since there are many columns, I would like to view them all
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 400)

df = pd.read_csv('kaggle_survey_2021_responses.csv')
df = df.iloc[1:,:] # The first row was describing the columns. Better to look at the description from the Metadata file provided
df.head(3).style.set_properties(**{"background-color": "#76c5d6","color": "black", "border-color": "black"})

print('Number of rows:', df.shape[0])
print('Number of columns:', df.shape[1])

Number of rows: 25973
Number of columns: 369

첫번째 질문 : 나이

df['Q1'].value_counts()

25-29    4931
18-21    4901
22-24    4694
30-34    3441
35-39    2504
40-44    1890
45-49    1375
50-54     964
55-59     592
60-69     553
70+       128
Name: Q1, dtype: int64

fig, ax = plt.subplots(figsize=(25,10), facecolor="w")

# Method for image
def make_img(img,zoom, x, y):
    img = mpimg.imread(img)
    imagebox = OffsetImage(img, zoom=zoom)
    ab = AnnotationBbox(imagebox, (x,y),frameon=False)
    ax.add_artist(ab)

img_file = "https://www.freeiconspng.com/thumbs/crown-icon/queen-crown-icon-4.png"
zoom = 1
img_y= 4.8

# Creating a DataFrame to get the values and their counts (this was for my purpose)
# new_df = pd.DataFrame(df['Q1'].value_counts())

# I wanted to have the highest value in the middle, so i wrote the following two code lines
age_bucket = ['70+','55-59','45-49','35-39','22-24','25-29','18-21','30-34','40-44','50-54','60-69']   #new_df.index
age_bucket_cnt = [128,592,1375,2504,4694,4931,4901,3441,1890,964,553]   #list(new_df.Q1.values)

color = ['#E6E6E6', '#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6'] # Deciding the color
width = [0.8, 0.8, 0.8, 0.8, 0.9, 0.9, 0.9, 0.8, 0.8, 0.8, 0.8] # The Width
alpha = [0.3, 0.45, 0.5, 0.6, 0.75, 1.0, 0.75, 0.6, 0.5, 0.45, 0.3] # The Opacity

fontsize= [20, 20, 20, 20, 25, 35, 30, 20, 20, 20, 20]
x_num = [0,1,2,3,4,5,6,7,8,9,10]

for i in range(11):
    plt.bar(x=age_bucket[i],height=age_bucket_cnt[i], width=width[i], color=color[i], alpha=alpha[i])
    plt.text(s=age_bucket[i],x=x_num[i],y=age_bucket_cnt[i],va='bottom',ha='center',fontsize=fontsize[i], alpha=alpha[i])
    plt.text(s="Age Bucket of all Kagglers",x=5,y=5500, fontsize=50,va='bottom',ha='center',color='#189AB4')

# Placing the image
make_img(img_file,0.2, 5, 4700)    
    
gc.collect() # For Memory Optimization

plt.axis('off')
plt.show()

확실히 대학생이나 취업 준비생이 많이 이용하는 느낌이다.

다만 18-21세 연령대 이용률이 생각보다 높은 것이 신기했다.

두번째 질문: 성별

df['Q2'].value_counts()

Man                        20598
Woman                       4890
Prefer not to say            355
Nonbinary                     88
Prefer to self-describe       42
Name: Q2, dtype: int64

Gender = ['Man', 'Woman', 'Others']
  
# Setting size in Chart based on 
# given values
Gender_cnt = [20598, 4890, 485]
  
# colors
colors = ['#E6E6E6', '#189AB4', '#FFFF00', 
          '#ADFF2F', '#FFA500']
# explosion
explode = (0.05, 0.05, 0.2)
  
    
plt.figure(figsize=[20,10])    
# Pie Chart
plt.pie(Gender_cnt, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2,
        explode=explode,)
  
# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()

plt.legend(Gender, loc = "upper right",title="Genders", prop={'size': 15})
     
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)
  
plt.rcParams['font.size'] = 25    
# Adding Title of chart
plt.text(s="Gender Diversity in Kaggle",x=0,y=1.3, fontsize=50,va='bottom',ha='center',color='#189AB4')
  
gc.collect()    
# Displaing Chart
plt.show()

남자가 약 80%, 여자가 약 18%이고 기타 이유(공개 희망 안함, 미 기제 등) 2% 입니다.

확실히 남성이 주류인 분야인 것 같습니다.

세번째 질문: 국적

df['Q3'].value_counts()

India                                                   7434
United States of America                                2650
Other                                                   1270
Japan                                                    921
China                                                    814
Brazil                                                   751
Russia                                                   742
Nigeria                                                  702
United Kingdom of Great Britain and Northern Ireland     550
Pakistan                                                 530
Egypt                                                    482
Germany                                                  470
Spain                                                    454
Indonesia                                                444
Turkey                                                   416
France                                                   401
South Korea                                              359
Taiwan                                                   334
Canada                                                   331
Bangladesh                                               317
Italy                                                    311
Mexico                                                   279
Viet Nam                                                 277
Australia                                                264
Kenya                                                    248
Colombia                                                 225
Poland                                                   219
Iran, Islamic Republic of...                             195
Ukraine                                                  186
Singapore                                                182
Argentina                                                182
Malaysia                                                 156
Netherlands                                              153
South Africa                                             146
Morocco                                                  140
Israel                                                   138
Thailand                                                 123
Portugal                                                 119
Peru                                                     117
United Arab Emirates                                     111
Tunisia                                                  109
Philippines                                              108
Sri Lanka                                                106
Chile                                                    102
Greece                                                   102
Ghana                                                     99
Saudi Arabia                                              89
Ireland                                                   84
Sweden                                                    81
Hong Kong (S.A.R.)                                        79
Nepal                                                     75
Switzerland                                               71
I do not wish to disclose my location                     69
Belgium                                                   65
Czech Republic                                            63
Romania                                                   61
Austria                                                   51
Belarus                                                   51
Ecuador                                                   50
Denmark                                                   48
Uganda                                                    47
Norway                                                    45
Kazakhstan                                                45
Algeria                                                   44
Ethiopia                                                  43
Iraq                                                      43
Name: Q3, dtype: int64

!pip install geopandas
import geopandas as gpd

# List of countries we are interested in
lis_countries = ["Algeria","Argentina","Australia","Austria","Bangladesh","Belarus","Belgium","Brazil","Canada","Chile","China","Colombia",
                 "Czechia","Denmark","Ecuador","Egypt","Ethiopia","France","Germany","Ghana","Greece","India","Indonesia","Iraq","Ireland",
                 "Israel","Italy","Japan","Kazakhstan","Kenya","Malaysia","Mexico","Morocco","Nepal","Netherlands","Nigeria","Norway","Pakistan",
                 "Peru","Philippines","Poland","Portugal","Romania","Russia","Saudi Arabia","South Africa","South Korea","Spain","Sri Lanka",
                 "Sweden","Switzerland","Taiwan","Thailand","Tunisia","Turkey","Uganda","Ukraine","United Arab Emirates","United Kingdom",
                 "United States of America","Vietnam"]

# Reading the geopandas data 
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

country_data = lis_countries # Passing the list of countries here
country_geo = list(world['name']) # The country list from the geopandas dataset

# List of all the values of population of Kagglers from each country
lis_pop = [44,182,264,51,317,51,65,751,331,102,814,225,63,48,50,482,43,401,470,99,102,7434,444,43,84,138,311,921,45,248,156,279,140,75,153,
           702,45,530,117,108,219,119,61,742,89,146,359,454,106,81,71,334,123,109,416,47,186,111,550,2650,277]

# Next we need to create a dataframe with lis_countries and lis_pop
our_country_analysis = pd.DataFrame(lis_countries, columns=['Country'])
our_country_analysis['KagglePopulation'] = lis_pop

# Next, we are going to visualize this...
mapped = world.set_index('name').join(our_country_analysis.set_index('Country')).reset_index()

to_be_mapped = 'KagglePopulation'
vmin, vmax = 0,10000
fig, ax = plt.subplots(1, figsize=(25,30))

mapped.dropna().plot(column=to_be_mapped, cmap='cividis', linewidth=0.8, ax=ax, edgecolors='1', alpha=0.7)

ax.text(s="Kagglers All Around the Globe",x=0,y=100, fontsize=50,va='bottom',ha='center',color='#189AB4')
ax.set_axis_off()

sm = plt.cm.ScalarMappable(cmap='cividis', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []

gc.collect()
cbar = fig.colorbar(sm, orientation='vertical', shrink= .25)

Requirement already satisfied: geopandas in /usr/local/lib/python3.7/dist-packages (0.10.2)
Requirement already satisfied: fiona>=1.8 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.8.20)
Requirement already satisfied: shapely>=1.6 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.8.0)
Requirement already satisfied: pyproj>=2.2.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (3.2.1)
Requirement already satisfied: pandas>=0.25.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.1.5)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2021.10.8)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.15.0)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (21.2.0)
Requirement already satisfied: munch in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (7.1.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2018.9)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2.8.2)

포화도가 높을 수록 노란색에 가까워 지는 것을 알 수 있습니다.

인도 사람들이 확실히 많이 이용하는 모습이군요.

중간중간 하얗게 빈 나라들도 있습니다.

네번째 질문: 학력

df['Q4'].value_counts()

Master’s degree                                                      10132
Bachelor’s degree                                                     9907
Doctoral degree                                                       2795
Some college/university study without earning a bachelor’s degree     1735
I prefer not to answer                                                 627
No formal education past high school                                   417
Professional doctorate                                                 360
Name: Q4, dtype: int64

fig, ax = plt.subplots(figsize=(25,10), facecolor="w")

# Method for image
def make_img(img,zoom, x, y):
    img = mpimg.imread(img)
    imagebox = OffsetImage(img, zoom=zoom)
    ab = AnnotationBbox(imagebox, (x,y),frameon=False)
    ax.add_artist(ab)

img_file = "https://www.freeiconspng.com/thumbs/crown-icon/queen-crown-icon-4.png"
zoom = 1
img_y= 4.8



# I wanted to have the highest value in the middle, so i wrote the following two code lines
age_bucket = ['Professional Doctorate','High School','Bachelor’s degree','Master’s degree','Doctoral degree','Others','No Answer']  
age_bucket_cnt = [360,417,9907,10132,2795,1735,627]  

color = ['#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6'] # Deciding the color
width = [0.8, 0.8, 0.9, 0.9, 0.9, 0.8, 0.8,] # The Width
alpha = [0.5, 0.6, 0.75, 1.0, 0.75, 0.6, 0.5] # The Opacity

fontsize= [12, 16, 18, 21, 16, 16, 16]
x_num = [0,1,2,3,4,5,6]

for i in range(7):
    plt.bar(x=age_bucket[i],height=age_bucket_cnt[i], width=width[i], color=color[i], alpha=alpha[i])
    plt.text(s=age_bucket[i],x=x_num[i],y=age_bucket_cnt[i],va='bottom',ha='center',fontsize=fontsize[i], alpha=alpha[i])
    plt.text(s="Educational Qualifications of all Kagglers",x=3,y=11000, fontsize=50,va='bottom',ha='center',color='#189AB4')

# Placing the image
make_img(img_file,0.25, 3, 9500)      
    
gc.collect() # For Memory Optimization

plt.axis('off')
plt.show()

대부분의 캐글 이용자들은 학사 이상의 학위를 가지고 있습니다.

(Master's degree : 석사, Bachelor's degree : 학사, Doctoral degree : 박사 학위)

다섯번째 질문: 직업

df['Q5'].value_counts()

Student                         6804
Data Scientist                  3616
Software Engineer               2449
Other                           2393
Data Analyst                    2301
Currently not employed          1986
Research Scientist              1538
Machine Learning Engineer       1499
Business Analyst                 968
Program/Project Manager          849
Data Engineer                    668
Product Manager                  319
Statistician                     313
DBA/Database Engineer            171
Developer Relations/Advocacy      99
Name: Q5, dtype: int64

# Method for image
def make_img(img,zoom, x, y):
    img = mpimg.imread(img)
    imagebox = OffsetImage(img, zoom=zoom)
    ab = AnnotationBbox(imagebox, (x,y),frameon=False)
    ax.add_artist(ab)

img_file = "https://www.freeiconspng.com/thumbs/crown-icon/queen-crown-icon-4.png"
zoom = 1
img_y= 4.8

fig, ax = plt.subplots(figsize=(25,10), facecolor="w")

# Creating a DataFrame to get the values and their counts (this was for my purpose)
# new_df = pd.DataFrame(df['Q1'].value_counts())

# I wanted to have the highest value in the middle, so i wrote the following two code lines
age_bucket = ['Developer\n Relations\n/Advocacy','Statistician','Data\n Engineer','Business\n Analyst','Research\n Scientist','Data\n Analyst','Software\n Engineer','Student',
              'Data\n Scientist','Other','Unemployed','ML\n Engineer','Project\n Manager','Product\n Manager','DB\n Engineer']   #new_df.index
age_bucket_cnt = [99,313,668,968,1538,2301,2449,6804,3414,2393,1986,1499,849,319,171]   #list(new_df.Q1.values)

color = ['#E6E6E6', '#189AB4', '#E6E6E6', '#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4','#E6E6E6','#189AB4', '#E6E6E6'] # Deciding the color
width = [0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.9, 0.9, 0.9, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8] # The Width
alpha = [0.3, 0.45, 0.3, 0.45, 0.5, 0.6, 0.75, 1.0, 0.75, 0.6, 0.5, 0.45, 0.3, 0.3, 0.45] # The Opacity

fontsize= [12, 12, 14, 14, 14, 14, 18, 20, 16, 14, 12, 14, 14, 12, 12]
x_num = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]

for i in range(15):
    plt.bar(x=age_bucket[i],height=age_bucket_cnt[i], width=width[i], color=color[i], alpha=alpha[i])
    plt.text(s=age_bucket[i],x=x_num[i],y=age_bucket_cnt[i],va='bottom',ha='center',fontsize=fontsize[i], alpha=alpha[i])
    plt.text(s="Current Role of all Kagglers",x=7.5,y=7500, fontsize=50,va='bottom',ha='center',color='#189AB4')

# Placing the image
make_img(img_file,0.15, 7, 6500)        
    
gc.collect() # For Memory Optimization

plt.axis('off')
plt.show()

작성자의 예측과 다르게 학생이 압도적으로 높은 수치가 나왔습니다.

(대부분이 ML 전문가나 데이터 분석가가 나올것이라고 생각한 것 같아요.)

여기서 주목할 점이 Others 입니다. 꽤 상위권에 위치하는데요.

타 분야 사람이 캐글 이용에 적극적인 것으로 생각할 수 있는데요. 데이터 분석이 많은 분야에서 응용될 수 있다는 것을 보여주는 것 같아요.

여섯번째 질문: 프로그래밍 경력

df['Q6'].value_counts()

1-3 years                    7874
< 1 years                    5881
3-5 years                    4061
5-10 years                   3099
10-20 years                  2166
20+ years                    1860
I have never written code    1032
Name: Q6, dtype: int64

years_bin = ['1-3years','<1years','3-5years','5-10years','10-20years','20+years','Never Coded']
years_cnt = [7874, 5881, 4061, 3099, 2166, 1860, 1032]

fig = plt.figure(figsize=(20,10))
plt.barh(width=years_cnt, y=years_bin, height=0.7, color = ['#189AB4', '#189AB4','#189AB4','#E6E6E6','#E6E6E6', '#E6E6E6', '#E6E6E6'], alpha=0.8)

##################### For the Years of Experience ###################################
s1 = ['1-3years','<1years','3-5years','5-10years','10-20years','20+years','Never Coded']
x1 = [8874, 6881, 5061, 4099, 3366, 2860, 2432]
y1 = [0,1,2,3,4,5,6]


for i in range(7):
    plt.text(s = s1[i], x=x1[i], y=y1[i] ,fontsize=25,va='center',ha='right',alpha=0.8)

plt.title("Average Years of Programming Experience of Kagglers", fontsize=42, pad=20, color='#189AB4')
plt.axis('off')
plt.gca().invert_yaxis()
plt.show()

캐글 내에 생각보다 코딩 경력이 오래된 사람이 많지 않습니다.

젏은 플렛폼이라고도 생각할 수 있고, 초보자가 접근하기 어렵지 않다고도 생각할 수 있겠네요.

일곱번째 질문: 프로그래밍 언어

df['Q7_Part_1'].value_counts()

Python    21860
Name: Q7_Part_1, dtype: int64

df['Q7_Part_2'].value_counts()

R    5334
Name: Q7_Part_2, dtype: int64

Tool = ['Python', 'R']
  
# Setting size in Chart based on 
# given values
Tool_cnt = [21860, 5334]
  
# colors
colors = ['#E6E6E6', '#189AB4']

# explosion
explode = (0.05, 0.05)
  

plt.figure(figsize=[20,10])   

# Pie Chart
plt.pie(Tool_cnt, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2,
        explode=explode,)
  
# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()

plt.legend(Tool, loc = "upper right",title="Programming Languages", prop={'size': 15})
     
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)
  
plt.rcParams['font.size'] = 25    
# Adding Title of chart
plt.text(s="Which Programming Tool do they Prefer?",x=0,y=1.3, fontsize=50,va='bottom',ha='center',color='#189AB4')
  
gc.collect()    
# Displaing Chart
plt.show()

파이썬과 R 이외에 다른 선택지도 있었고, 중복 선택이 허용된 문항이지만 작성자는 파이썬과 R만을 비교했습니다.

파이썬이 80% 이상으로 압도적인 사용률을 보였는데요.

앞서 조사한 결과에서 학생인 사람이 많고, 타 분야 전문가도 많기 때문에 쉬운 언어인 파이썬의 사용률이 높지 않을까 생각했어요.

여덟번째 질문: 프로그래밍 언어2

df['Q8'] = df['Q8'].apply(lambda x: 'Others' if x not in ['Python','R','SQL'] else x)
df['Q8'].value_counts()

Python    20213
Others     2977
R          1445
SQL        1338
Name: Q8, dtype: int64

Tool = ['Python', 'R', 'SQL', 'Others']
  
# Setting size in Chart based on 
# given values
Tool_cnt = [20213, 1445, 1338, 2977]
  
# colors
colors = ['#E6E6E6', '#189AB4', '#FFFF00', '#ADFF2F']

# explosion
explode = (0.05, 0.05, 0.05, 0.05)
  

plt.figure(figsize=[20,10])   

# Pie Chart
plt.pie(Tool_cnt, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2,
        explode=explode,)
  
# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()

plt.legend(Tool, loc = "upper right",title="Programming Languages", prop={'size': 15})
     
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)
  
plt.rcParams['font.size'] = 25    
# Adding Title of chart
plt.text(s="What do they Recommend for Data Science?",x=0,y=1.3, fontsize=50,va='bottom',ha='center',color='#189AB4')
  
gc.collect()    
# Displaing Chart
plt.show()

앞선 조사와 비슷한데, 차이점은 중복선택이 안된다는 점입니다.

선택지가 꽤 많았는데도 파이썬이 압도적인 선택률을 보이네요.

아홉번째 질문: 프로그래밍 환경(IDE)

df['Q9_Part_1'].value_counts()

Jupyter (JupyterLab, Jupyter Notebooks, etc)     5488
Name: Q9_Part_1, dtype: int64

df['Q9_Part_2'].value_counts()

 RStudio     4771
Name: Q9_Part_2, dtype: int64

이런식으로 값을 추출해서 적용한 것 같아요.

name = ['JupyterLab','RStudio','Visual Studio','VS Code','PyCharm','Spyder','Notepad++','Sublime Text','Vim/Emacs','MATLAB','Jupyter Notebook','None','Other']
value = [5488,4771,4110,10040,7468,3794,3937,2839,1646,2203,16233,526,1491]

# Creating a dataframe to store this information
df_nine_ = pd.DataFrame(name, columns=['IDE'])
df_nine_['Values'] = value
df_nine_ = df_nine_.sort_values(by="Values", ascending=False)
df_nine_

fig = plt.figure(figsize=(20,10))
plt.barh(width=list(df_nine_['Values'].unique()), y=list(df_nine_['IDE'].unique()), height=0.7, color = ['#189AB4', '#189AB4', '#189AB4', '#E6E6E6','#E6E6E6','#E6E6E6','#E6E6E6', '#E6E6E6', '#E6E6E6', '#E6E6E6', '#E6E6E6', '#E6E6E6', '#E6E6E6'], alpha=0.8)

##################### For the Years of Experience ###################################
s1 = list(df_nine_['IDE'].unique())
x1 = [19833,12040,9468,7788,6471,6810,6437,5294,5539,4003,3946,2691,1726]
y1 = [0,1,2,3,4,5,6,7,8,9,10,11,12]


for i in range(13):
    plt.text(s = s1[i], x=x1[i], y=y1[i] , fontsize=25,va='center',ha='right',alpha=0.8)


plt.title("Preferred IDE of Kagglers", fontsize=42, pad=20, color='#189AB4')
plt.axis('off')
plt.gca().invert_yaxis()
gc.collect()
plt.show()

주피터 노트북이 사용자 친화적이라고 코멘트를 합니다. 시프트+엔터시 결과물이 바로 나와 편리하다는 근거와 함께.

VS CODE는 다른 언어(C) 할때 저도 사용했는데, 깃허브와 연동이 좋아서 사용이 편리합니다. 역시 많은 사용자가 이용하는 것 같아요.

파이참도 저는 써보진 않았지만 높은 순위를 기록합니다.

R을 사용하는 사람 비율 대비 R스튜디오도 많이 쓰는 모습을 보이는데, 대부분에 R 사용자가 R스튜디오를 사용한다고 생각됩니다.

열번째 질문: 주 사용 노트북

df['Q10_Part_1'].value_counts()

 Kaggle Notebooks    9507
Name: Q10_Part_1, dtype: int64

df['Q10_Part_2'].value_counts()

Colab Notebooks    9792
Name: Q10_Part_2, dtype: int64

코랩 노트북, 캐글 노트북 이용자 이외는 Other로 생각한 것 같습니다.

def make_img(img,zoom, x, y):
    img = mpimg.imread(img)
    imagebox = OffsetImage(img, zoom=zoom)
    ab = AnnotationBbox(imagebox, (x,y),frameon=False)
    ax.add_artist(ab)

img_file = "https://www.freeiconspng.com/thumbs/crown-icon/queen-crown-icon-4.png"
zoom = 1
img_y= 4.8


# Visualizing the Hosted Notebooks. (Hidden Input)

fig, ax = plt.subplots(figsize=(25,10), facecolor="w")


age_bucket = ['None','Colab Notebook','Kaggle Notebook']   
age_bucket_cnt = [7174,9792,9507]  

color = ['#E6E6E6','#189AB4','#E6E6E6'] # Deciding the color
width = [0.9, 0.9, 0.9] # The Width
alpha = [0.55, 1.0, 0.75] # The Opacity

fontsize= [25, 45, 30]
x_num = [0,1,2]

for i in range(3):
    plt.bar(x=age_bucket[i],height=age_bucket_cnt[i], width=width[i], color=color[i], alpha=alpha[i])
    plt.text(s=age_bucket[i],x=x_num[i],y=age_bucket_cnt[i],va='bottom',ha='center',fontsize=fontsize[i], alpha=alpha[i])
    plt.text(s="Preferred Hosted Notebooks",x=1,y=11000, fontsize=50,va='bottom',ha='center',color='#189AB4')

# Placing the image
make_img(img_file,0.3, 1, 9000)    
    
gc.collect() # For Memory Optimization

plt.axis('off')
plt.show()

코랩 노트북과 캐글 노트북의 사용자 수가 비슷합니다.

코랩 노트북은 점유율 1위로, GPU 사용이 일부 가능하고 구글 드라이브와 연동이 잘된다는 점을 큰 장점으로 소개합니다.

물론 캐글 이용자 조사이기 때문에 캐글 데이터와 캐글 노트북 간 호완성, 접근성이 좋아서 캐글 노트북 사용자가 다소 많이 집계됬습니다.

다만 캐글 노트북 만에 분명한 장점이 있겠죠? 한번 어느 환경인지 기회될때 탐색하는 것도 좋을 것 같아요.

또 특이한 점은 두 노트북 이외 각자의 PC환경을 사용하는 사람도 꽤 많다는 것입니다.

열한번째 질문: 가속기 유무

df['Q12_Part_1'].value_counts()

 NVIDIA GPUs     8036
Name: Q12_Part_1, dtype: int64

df['Q12_Part_2'].value_counts()

 Google Cloud TPUs     3451
Name: Q12_Part_2, dtype: int64

df['Q12_Part_3'].value_counts()

 AWS Trainium Chips     414
Name: Q12_Part_3, dtype: int64

df['Q12_Part_4'].value_counts()

 AWS Inferentia Chips     416
Name: Q12_Part_4, dtype: int64

df['Q12_Part_5'].value_counts()

None    13234
Name: Q12_Part_5, dtype: int64

df['Q12_OTHER'].value_counts()

Other    867
Name: Q12_OTHER, dtype: int64

name = ["None","NVIDIA GPUs","Google Cloud TPUs","Other","AWS Inferentia Chips","AWS Trainium Chips"]
count = [13234,8036,3451,867,416,414]

# Visualizing using a barh:
fig = plt.figure(figsize=(20,10))
plt.barh(width=count, y=name, height=0.7, color = ['#E6E6E6', '#189AB4', '#189AB4', '#E6E6E6','#E6E6E6','#E6E6E6'], alpha=0.8)

##################### For the Years of Experience ###################################
s1 = name
x1 = [14234,10236,6651,2067,3916,3714]
y1 = [0,1,2,3,4,5]


for i in range(6):
    plt.text(s = s1[i], x=x1[i], y=y1[i] , fontsize=25,va='center',ha='right',alpha=0.8)

plt.title("Specialized Hardware", fontsize=42, pad=20, color='#189AB4')
plt.axis('off')
plt.gca().invert_yaxis()
gc.collect()
plt.show()

GPU나 TPU를 사용하지 않는 캐글 사용자가 상당히 많이 있네요.

느낀점

대회참가를 위한 데이터 공부가 아니라 설문조사를 시각화 하는 공부였습니다.

이쁘게 시각화 하기 위해서 작성자가 다양하게 노력한 모습을 확인했습니다.

또한 설문조사가 캐글 이용자 관련 설문조사라서 결과에 대해 더 흥미롭게 확인 한 것 같아요.

가볍게 공부하기 좋은 데이터 셋인것 같습니다.

대회 출처 : https://www.kaggle.com/c/kaggle-survey-2021

코드 출처 : https://www.kaggle.com/vivek468/what-s-up-kaggle-kaggle-survey-2021

	Time from Start to Finish (seconds)	Q1	Q2	Q3	Q4	Q5	Q6	Q7_Part_1	Q7_Part_2	Q7_Part_3	Q7_Part_4	Q7_Part_5	Q7_Part_6	Q7_Part_7	Q7_Part_8	Q7_Part_9	Q7_Part_10	Q7_Part_11	Q7_Part_12	Q7_OTHER	Q8	Q9_Part_1	Q9_Part_2	Q9_Part_3	Q9_Part_4	Q9_Part_5	Q9_Part_6	Q9_Part_7	Q9_Part_8	Q9_Part_9	Q9_Part_10	Q9_Part_11	Q9_Part_12	Q9_OTHER	Q10_Part_1	Q10_Part_2	Q10_Part_3	Q10_Part_4	Q10_Part_5	Q10_Part_6	Q10_Part_7	Q10_Part_8	Q10_Part_9	Q10_Part_10	Q10_Part_11	Q10_Part_12	Q10_Part_13	Q10_Part_14	Q10_Part_15	Q10_Part_16	Q10_OTHER	Q11	Q12_Part_1	Q12_Part_2	Q12_Part_3	Q12_Part_4	Q12_Part_5	Q12_OTHER	Q13	Q14_Part_1	Q14_Part_2	Q14_Part_3	Q14_Part_4	Q14_Part_5	Q14_Part_6	Q14_Part_7	Q14_Part_8	Q14_Part_9	Q14_Part_10	Q14_Part_11	Q14_OTHER	Q15	Q16_Part_1	Q16_Part_2	Q16_Part_3	Q16_Part_4	Q16_Part_5	Q16_Part_6	Q16_Part_7	Q16_Part_8	Q16_Part_9	Q16_Part_10	Q16_Part_11	Q16_Part_12	Q16_Part_13	Q16_Part_14	Q16_Part_15	Q16_Part_16	Q16_Part_17	Q16_OTHER	Q17_Part_1	Q17_Part_2	Q17_Part_3	Q17_Part_4	Q17_Part_5	Q17_Part_6	Q17_Part_7	Q17_Part_8	Q17_Part_9	Q17_Part_10	Q17_Part_11	Q17_OTHER	Q18_Part_1	Q18_Part_2	Q18_Part_3	Q18_Part_4	Q18_Part_5	Q18_Part_6	Q18_OTHER	Q19_Part_1	Q19_Part_2	Q19_Part_3	Q19_Part_4	Q19_Part_5	Q19_OTHER	Q20	Q21	Q22	Q23	Q24_Part_1	Q24_Part_2	Q24_Part_3	Q24_Part_4	Q24_Part_5	Q24_Part_6	Q24_Part_7	Q24_OTHER	Q25	Q26	Q27_A_Part_1	Q27_A_Part_2	Q27_A_Part_3	Q27_A_Part_4	Q27_A_Part_5	Q27_A_Part_6	Q27_A_Part_7	Q27_A_Part_8	Q27_A_Part_9	Q27_A_Part_10	Q27_A_Part_11	Q27_A_OTHER	Q28	Q29_A_Part_1	Q29_A_Part_2	Q29_A_Part_3	Q29_A_Part_4	Q29_A_OTHER	Q30_A_Part_1	Q30_A_Part_2	Q30_A_Part_3	Q30_A_Part_4	Q30_A_Part_5	Q30_A_Part_6	Q30_A_Part_7	Q30_A_OTHER	Q31_A_Part_1	Q31_A_Part_2	Q31_A_Part_3	Q31_A_Part_4	Q31_A_Part_5	Q31_A_Part_6	Q31_A_Part_7	Q31_A_Part_8	Q31_A_Part_9	Q31_A_OTHER	Q32_A_Part_1	Q32_A_Part_2	Q32_A_Part_3	Q32_A_Part_4	Q32_A_Part_5	Q32_A_Part_6	Q32_A_Part_7	Q32_A_Part_8	Q32_A_Part_9	Q32_A_Part_10	Q32_A_Part_11	Q32_A_Part_12	Q32_A_Part_13	Q32_A_Part_14	Q32_A_Part_15	Q32_A_Part_16	Q32_A_Part_17	Q32_A_Part_18	Q32_A_Part_19	Q32_A_Part_20	Q32_A_OTHER	Q33	Q34_A_Part_1	Q34_A_Part_2	Q34_A_Part_3	Q34_A_Part_4	Q34_A_Part_5	Q34_A_Part_6	Q34_A_Part_7	Q34_A_Part_8	Q34_A_Part_9	Q34_A_Part_10	Q34_A_Part_11	Q34_A_Part_12	Q34_A_Part_13	Q34_A_Part_14	Q34_A_Part_15	Q34_A_Part_16	Q34_A_OTHER	Q35	Q36_A_Part_1	Q36_A_Part_2	Q36_A_Part_3	Q36_A_Part_4	Q36_A_Part_5	Q36_A_Part_6	Q36_A_Part_7	Q36_A_OTHER	Q37_A_Part_1	Q37_A_Part_2	Q37_A_Part_3	Q37_A_Part_4	Q37_A_Part_5	Q37_A_Part_6	Q37_A_Part_7	Q37_A_OTHER	Q38_A_Part_1	Q38_A_Part_2	Q38_A_Part_3	Q38_A_Part_4	Q38_A_Part_5	Q38_A_Part_6	Q38_A_Part_7	Q38_A_Part_8	Q38_A_Part_9	Q38_A_Part_10	Q38_A_Part_11	Q38_A_OTHER	Q39_Part_1	Q39_Part_2	Q39_Part_3	Q39_Part_4	Q39_Part_5	Q39_Part_6	Q39_Part_7	Q39_Part_8	Q39_Part_9	Q39_OTHER	Q40_Part_1	Q40_Part_2	Q40_Part_3	Q40_Part_4	Q40_Part_5	Q40_Part_6	Q40_Part_7	Q40_Part_8	Q40_Part_9	Q40_Part_10	Q40_Part_11	Q40_OTHER	Q41	Q42_Part_1	Q42_Part_2	Q42_Part_3	Q42_Part_4	Q42_Part_5	Q42_Part_6	Q42_Part_7	Q42_Part_8	Q42_Part_9	Q42_Part_10	Q42_Part_11	Q42_OTHER	Q27_B_Part_1	Q27_B_Part_2	Q27_B_Part_3	Q27_B_Part_4	Q27_B_Part_5	Q27_B_Part_6	Q27_B_Part_7	Q27_B_Part_8	Q27_B_Part_9	Q27_B_Part_10	Q27_B_Part_11	Q27_B_OTHER	Q29_B_Part_1	Q29_B_Part_2	Q29_B_Part_3	Q29_B_Part_4	Q29_B_OTHER	Q30_B_Part_1	Q30_B_Part_2	Q30_B_Part_3	Q30_B_Part_4	Q30_B_Part_5	Q30_B_Part_6	Q30_B_Part_7	Q30_B_OTHER	Q31_B_Part_1	Q31_B_Part_2	Q31_B_Part_3	Q31_B_Part_4	Q31_B_Part_5	Q31_B_Part_6	Q31_B_Part_7	Q31_B_Part_8	Q31_B_Part_9	Q31_B_OTHER	Q32_B_Part_1	Q32_B_Part_2	Q32_B_Part_3	Q32_B_Part_4	Q32_B_Part_5	Q32_B_Part_6	Q32_B_Part_7	Q32_B_Part_8	Q32_B_Part_9	Q32_B_Part_10	Q32_B_Part_11	Q32_B_Part_12	Q32_B_Part_13	Q32_B_Part_14	Q32_B_Part_15	Q32_B_Part_16	Q32_B_Part_17	Q32_B_Part_18	Q32_B_Part_19	Q32_B_Part_20	Q32_B_OTHER	Q34_B_Part_1	Q34_B_Part_2	Q34_B_Part_3	Q34_B_Part_4	Q34_B_Part_5	Q34_B_Part_6	Q34_B_Part_7	Q34_B_Part_8	Q34_B_Part_9	Q34_B_Part_10	Q34_B_Part_11	Q34_B_Part_12	Q34_B_Part_13	Q34_B_Part_14	Q34_B_Part_15	Q34_B_Part_16	Q34_B_OTHER	Q36_B_Part_1	Q36_B_Part_2	Q36_B_Part_3	Q36_B_Part_4	Q36_B_Part_5	Q36_B_Part_6	Q36_B_Part_7	Q36_B_OTHER	Q37_B_Part_1	Q37_B_Part_2	Q37_B_Part_3	Q37_B_Part_4	Q37_B_Part_5	Q37_B_Part_6	Q37_B_Part_7	Q37_B_OTHER	Q38_B_Part_1	Q38_B_Part_2	Q38_B_Part_3	Q38_B_Part_4	Q38_B_Part_5	Q38_B_Part_6	Q38_B_Part_7	Q38_B_Part_8	Q38_B_Part_9	Q38_B_Part_10	Q38_B_Part_11	Q38_B_OTHER
1	910	50-54	Man	India	Bachelor’s degree	Other	5-10 years	Python	R	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Python	nan	nan	nan	nan	nan	nan	nan	nan	Vim / Emacs	nan	nan	nan	nan	nan	Colab Notebooks	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	A laptop	nan	Google Cloud TPUs	nan	nan	nan	nan	2-5 times	Matplotlib	Seaborn	nan	Ggplot / ggplot2	Shiny	nan	nan	nan	nan	Leaflet / Folium	nan	nan	5-10 years	Scikit-learn	TensorFlow	nan	nan	nan	nan	nan	nan	nan	nan	nan	Caret	nan	nan	nan	nan	nan	nan	Linear or Logistic Regression	Decision Trees or Random Forests	Gradient Boosting Machines (xgboost, lightgbm, etc)	Bayesian Approaches	nan	Dense Neural Networks (MLPs, etc)	Convolutional Neural Networks	nan	Recurrent Neural Networks	nan	nan	nan	General purpose image/video tools (PIL, cv2, skimage, etc)	nan	nan	nan	nan	nan	nan	Word embeddings/vectors (GLoVe, fastText, word2vec)	nan	nan	nan	nan	nan	Manufacturing/Fabrication	50-249 employees	3-4	No (we do not use ML methods)	nan	nan	nan	nan	nan	nan	None of these activities are an important part of my role at work	nan	25,000-29,999	$100-$999	nan	nan	Google Cloud Platform (GCP)	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Google Cloud Compute Engine	nan	nan	nan	nan	nan	nan	Google Cloud Storage (GCS)	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	No / None	nan	nan	PostgreSQL	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	None	nan	nan	nan	nan	nan	nan	nan	nan	No / None	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	No / None	nan	nan	nan	nan	GitHub	nan	Kaggle	nan	nan	nan	nan	Coursera	edX	Kaggle Learn Courses	DataCamp	nan	Udacity	Udemy	nan	nan	nan	nan	nan	Local development environments (RStudio, JupyterLab, etc.)	nan	Email newsletters (Data Elixir, O'Reilly Data & AI, etc)	nan	Kaggle (notebooks, forums, etc)	nan	YouTube (Kaggle YouTube, Cloud AI Adventures, etc)	Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)	Blogs (Towards Data Science, Analytics Vidhya, etc)	Journal Publications (peer-reviewed journals, conference proceedings, etc)	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
2	784	50-54	Man	Indonesia	Master’s degree	Program/Project Manager	20+ years	nan	nan	SQL	C	C++	Java	nan	nan	nan	nan	nan	nan	nan	Python	nan	nan	nan	nan	nan	nan	Notepad++	nan	nan	nan	Jupyter Notebook	nan	nan	Kaggle Notebooks	Colab Notebooks	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)	nan	nan	nan	nan	None	nan	Never	Matplotlib	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Under 1 year	Scikit-learn	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Linear or Logistic Regression	Decision Trees or Random Forests	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Manufacturing/Fabrication	1000-9,999 employees	1-2	We are exploring ML methods (and may one day put a model into production)	nan	Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data	nan	nan	nan	nan	nan	nan	60,000-69,999	$0 ($USD)	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Kaggle Learn Courses	nan	nan	nan	nan	nan	Cloud-certification programs (direct from AWS, Azure, GCP, or similar)	University Courses (resulting in a university degree)	nan	nan	Advanced statistical software (SPSS, SAS, etc.)	nan	nan	nan	nan	nan	nan	nan	nan	Journal Publications (peer-reviewed journals, conference proceedings, etc)	nan	nan	nan	nan	nan	Google Cloud Platform (GCP)	nan	Oracle Cloud	nan	nan	nan	nan	nan	nan	nan	nan	nan	Google Cloud Compute Engine	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	None	nan	MySQL	nan	SQLite	Oracle Database	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Google Cloud SQL	nan	nan	nan	nan	nan	nan	nan	Google Data Studio	nan	nan	nan	nan	Qlik	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Automated model selection (e.g. auto-sklearn, xcessiv)	nan	nan	nan	nan	nan	Google Cloud AutoML	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	None	nan
3	924	22-24	Man	Pakistan	Master’s degree	Software Engineer	1-3 years	Python	nan	nan	nan	C++	Java	nan	nan	nan	nan	nan	nan	nan	Python	nan	nan	nan	nan	PyCharm	nan	nan	nan	nan	nan	Jupyter Notebook	nan	Other	Kaggle Notebooks	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	A laptop	nan	nan	nan	nan	nan	Other	Never	Matplotlib	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	I do not use machine learning methods	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Academics/Education	1000-9,999 employees	0	I do not know	nan	nan	nan	nan	nan	nan	None of these activities are an important part of my role at work	nan	$0-999	$0 ($USD)	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	None	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	DataRobot	nan	nan	nan	nan	nan	nan	MySQL	nan	nan	nan	MongoDB	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	MySQL	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	None	nan	nan	nan	nan	nan	nan	nan	nan	No / None	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	No / None	nan	nan	nan	nan	nan	nan	nan	nan	nan	I do not share my work publicly	nan	nan	nan	nan	DataCamp	nan	nan	nan	nan	nan	nan	nan	nan	Basic statistical software (Microsoft Excel, Google Sheets, etc.)	nan	nan	nan	Kaggle (notebooks, forums, etc)	nan	YouTube (Kaggle YouTube, Cloud AI Adventures, etc)	nan	nan	nan	nan	nan	nan	Amazon Web Services (AWS)	nan	Google Cloud Platform (GCP)	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Microsoft Azure Virtual Machines	Google Cloud Compute Engine	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Azure Machine Learning Studio	Google Cloud Vertex AI	DataRobot	nan	nan	nan	nan	nan	nan	MySQL	PostgreSQL	nan	nan	MongoDB	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Microsoft Power BI	nan	nan	nan	Tableau	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Automated model selection (e.g. auto-sklearn, xcessiv)	nan	nan	nan	nan	nan	nan	nan	nan	DataRobot AutoML	nan	nan	nan	nan	nan	nan	nan	nan	TensorBoard	nan	nan	nan	nan	nan	nan	nan