コロナに負けず、できることを少しずつ。

[学習]word2vec で遊ぶ。

NO IMAGE

word2vec (自然言語処理)を落ち着いた環境で、手短に遊ぶ。

環境を作るのに、ドエライ時間を要するが、学習は個人差があるように、機械にも性能差があるのは否めないので、make したまま寝るのが一番。

[amazon_enhanced asin=”4873114705″ /][amazon_enhanced asin=”4339027510″ /][amazon_enhanced asin=”4339024511″ /][amazon_enhanced asin=”4339024694″ /]

===

■環境を作る。

yukio@dynabook ~/word2vec/word2vec-read-only
$ ./demo-phrases.sh
make: Nothing to be done for ‘all’.
Starting training using file news.2012.en.shuffled-norm0
Words processed: 296900K     Vocab size: 33198K
Vocab size (unigrams + bigrams): 18838711
Words in train file: 296901342
Words written: 296900K
real    36m31.377s
user    21m55.781s
sys     7m5.359s
Starting training using file news.2012.en.shuffled-norm0-phrase0
Words processed: 280500K     Vocab size: 38761K
Vocab size (unigrams + bigrams): 21728781
Words in train file: 280513979
Words written: 280500K
real    29m38.999s
user    19m26.875s
sys     7m46.811s
Starting training using file news.2012.en.shuffled-norm1-phrase1
Vocab size: 681320
Words in train file: 283545447
Alpha: 0.000005  Progress: 100.00%  Words/thread/sec: 84.98k
real    111m20.243s
user    838m48.234s
sys     0m53.780s

けっこうな時間を要しましたね。

■ ‘Computer’ では何が出力されるのかな?

Enter word or sentence (EXIT to break): computer

Word: computer  Position in vocabulary: 1922

Word       Cosine distance
————————————————————————
computers              0.816240
software              0.708589
laptop              0.694821
computer’s              0.676898
keystrokes              0.653019
electronic              0.631960
device              0.622195
computers’              0.618579
mobile_device              0.609824
flash_drive              0.606466
desktop_computer              0.601706
internet_connection              0.599446
thumb_drive              0.595328
keystroke              0.593948
computerized              0.590586
desktop              0.587194
web              0.586146
malware              0.585932
word_processing              0.585793
debug              0.584467
wi_fi_hotspot              0.584011
error_messages              0.583115
laptop_computer              0.582110
arduino              0.580863
user              0.580396
your_computer’s              0.580019
spyware              0.578177
server              0.577982
handheld_devices              0.577063
automated              0.576914
devices              0.576548
tracking_software              0.576425
web_servers              0.576260
computer’s_hard_drive              0.576223
mobile_phone              0.575662
usb_drive              0.574477
encryption              0.573236
malicious_code              0.572133
remote_server              0.569906
desktop_pc              0.569610

興味深いのは、サーバよりもアルドィーノのほうが近接度が高いということ。コーパスが作成された時期にも因るのでしょうが。

■ ‘Japan’ では・・・

Enter word or sentence (EXIT to break): japan

Word: japan  Position in vocabulary: 1035

Word       Cosine distance
————————————————————————
japan’s              0.783928
south_korea              0.781404
china              0.733827
japanese              0.723014
tokyo              0.704671
asia              0.644378
europe              0.639472
other_asian_nations              0.620738
taiwan              0.619987
germany              0.615176
kuril_islands              0.613550
tokyo_march_upi              0.613329
india              0.612798
korea              0.609041
countries              0.592254
asia_excluding              0.583868
united_states              0.583465
thailand              0.582957
tokyo_april_upi              0.580327
brazil              0.576181
china’s              0.574224
ap_tokyo              0.573809
senkaku_diaoyu_islands              0.573809
territorial_dispute_between              0.573254
over_disputed_islands              0.572992
tokyo_sept_upi              0.570736
territorial_row_between              0.567809
tokyo_nov_upi              0.567369
philippines              0.563850
last_year’s_fukushima_nuclear              0.563020
ryukyu_islands              0.562682
asian_countries              0.562654
japan_south_korea              0.558992
south_korean              0.558505
australia              0.557522
russia              0.556449
chinese              0.556147
tokyo_dec_upi              0.556115
territorial_row              0.553880
beijing              0.552677

コーパスの作成時期に依存することがハッキリ分かる結果ですね。

■外国にも著名な街 “Akihabara’ では、観光地との近接度が高いですね。

Enter word or sentence (EXIT to break): akihabara

Word: akihabara  Position in vocabulary: 300750

Word       Cosine distance
————————————————————————
ginza              0.630624
shibuya              0.627033
asakusa              0.582663
shinjuku              0.578208
harajuku              0.562503
omotesando              0.562430
shibuya_district              0.545689
aoyama              0.535024
roppongi_district              0.534838
roppongi              0.530840
yoyogi              0.530087
shopping_arcades              0.528585
wako              0.516259
co_jp              0.513500
ginza_district              0.508774
okayama              0.507944
tokyo’s_ginza              0.506670
tokyo              0.505515
tokyo’s              0.498354
yoshinori              0.496685
osaka              0.493471
otaku              0.493339
hiroko_tabuchi_contributed_reporting              0.493252
buynow              0.490066
shopping_district              0.489351
nihon              0.488485
zeniya              0.485182
shimbashi              0.482262
zhongguancun              0.480785
roppongi_hills              0.479378
tetsuo              0.474016
yoyogi_park              0.471967
minami              0.471964
azabu              0.470780
osaka’s              0.469058
laforet              0.468352
yanagi              0.466992
electronics_store              0.465139
electronics              0.464955
nikkei_index_shed              0.463926

■では、’service’ では。

Enter word or sentence (EXIT to break): service

Word: service  Position in vocabulary: 495

Word       Cosine distance
————————————————————————
services              0.762807
service_providers              0.570149
service_provider              0.557932
customers              0.547668
network              0.544316
access              0.541908
provider              0.532947
operators              0.526344
customer_service              0.524407
providers              0.523774
service’s              0.508977
facilities              0.508721
lebara              0.508005
employees              0.503374
monthly_subscription              0.502840
functions              0.497853
mobile              0.497117
providing              0.494426
stations              0.493734
helotrac_x              0.492092
delivery              0.492068
maintenance              0.486617
internet_access              0.486330
voip              0.486298
postal_services              0.485677
gametanium              0.484630
online_portal              0.484593
systems              0.482075
broadband_access              0.480736
inaer              0.480549
staff              0.478217
exent’s              0.477309
high_bandwidth              0.476419
subscription_based              0.472000
system              0.471115
call_centers              0.470993
enabling              0.470146
streamwide              0.468853
customer              0.468721
users              0.467595

■もしや、2012年のコーパスであっても、フレーズとしての組み合わせ近接を計算できるのではないだろうか。

‘service science’ ではどうだろう。これを最後の検索に。

Enter word or sentence (EXIT to break): service science

Word: service  Position in vocabulary: 495

Word: science  Position in vocabulary: 1655

Word       Cosine distance
————————————————————————
services              0.660516
scientific_research              0.626619
scientific              0.623595
technology              0.622529
educational              0.607284
research              0.597329
technologies              0.554766
engineering              0.554162
resource              0.544080
innovation              0.543709
systems              0.542555
programs              0.537852
fully_accredited              0.535299
expertise              0.534186
education              0.532555
science_engineering              0.532547
communication              0.527393
lifelong_learning              0.521500
software_engineering              0.520856
program              0.519150
functions              0.516761
teaching              0.513441
computing              0.512357
applications              0.512339
enterprise              0.511462
communications              0.507896
physical_sciences              0.506098
innovative_technology              0.506033
scientific_discoveries              0.501350
biomedical              0.501021
technology_directorate              0.498348
literacy              0.497534
curriculum              0.496828
solutions              0.496724
software_development              0.495778
information_technology              0.495600
math_science              0.494913
academic_research              0.494798
cutting_edge_research              0.494436
collaborative              0.494172

以上

未分類カテゴリの最新記事