[IT] facebook crawler の考察

投稿者斉藤之雄 (Yukio Saito) 2011年9月9日

facebook で自分が書き込んだりアップロードしたものは facebook アカウントから疑似ダウンロードが（ダウンロード通知をメールで受けてから）可能であったり、facebook mail を使うなどの方法　⇒　https://www.fxfrog.com/?p=3168　はあるが、やはり自分が参加しているグループのアーカイブや、統計目的の情報取得したいというニーズが以前にも増して聞こえてくるようになりました。

—

■なぜ、そのような声が出ているのか？

理由は簡単です。従来型の単純 https (https) get 処理を行うプログラムでは facebook は通信を中継しない実装になっているからです。ここでいう単純な get 処理とは、wget といったシンプルツールや io:http のような関数を条件に合わせて取得分岐をするようなものも含みます。

↓

■ならば、ネットに似たようなニーズはないものか？

ググってみると、北米では Java で書かれた FaceBukkCraw というのが一時期存在していたようですが、sourceforge リポジトリに何も残っていないので本当に成果物があったのかさえ疑わしい状況です。また前述したように wget （幾つかの引数、クッキーディレクトリも参照させ）ベースで facebook に対して http get が出来ないから助けて〜ドラえもん！的なスレッドを散見しました。

↓

■普段クロールが難しい場合に考える手法では？

http cache (proxy) を立ててスクリーン・スクレイピングを行う手法がありますが、スクレイピングをと考えると同時に facebook って API が提供されてるからアプリでもユーザ情報を取得できるし、アプリ起動に必要以上のデモグラフ情報も取得してるのもあるよな！と思い出した。それを手掛かりにさらに検索してみると・・・

↓

■python で実装されたサンプルコードがありました。

http://floatinginspace.za.org/facebook_status/

Sample code to use it would be:

import facebook_status
fb = facebook_status.fbStatus()
fb.change_status(‘messing around with FB and python’)
Get the source file here: facebook_status.py
For a similar solution, but in php, see this nexdot.net blog post.
[sourcecode language=”php”]
# This class adds python functions to change your Facebook status message – v0.1
# Copyright (C) 2007 Francois du Toit

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program. If not, see.

import os.path, sys, cookielib, urllib2, urllib, re

class fbStatus:

def __init__(self):
self.COOKIEFILE = ‘cookies.lwp’
self.urlopen = urllib2.urlopen
self.cj = cookielib.LWPCookieJar()
self.Request = urllib2.Request

if self.cj != None:
if os.path.isfile(self.COOKIEFILE):
self.cj.load(self.COOKIEFILE)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
urllib2.install_opener(opener)

def get_url(self, url, data=None, headers={‘User-agent’ : ‘Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0.666b1’}, read=0):
try:
req = self.Request(url, data, headers)
handle = self.urlopen(req)
except IOError, e:
print ‘We failed to open “%s”.’ % url
if hasattr(e, ‘code’):
print ‘We failed with error code – %s.’ % e.code
elif hasattr(e, ‘reason’):
print “The error object has the following ‘reason’ attribute :”, e.reason
print “This usually means the server doesn’t exist, is down, or we don’t have an internet connection.”
sys.exit()
else:
print ‘Here are the headers of the page :’
print handle.code, handle.msg
#print handle.info()
if read and handle.code == 200:
return handle.read()
else:
return handle.code

def change_status(self, email, password, status):

# log in and get cookies, yum yum
print ‘changing Facebook status message’
url = ‘https://login.facebook.com/login.php?m&next=http%3A%2F%2Fm.facebook.com%2Fhome.php’
data = urllib.urlencode({’email’: email, ‘pass’: password, ‘login’: ‘Login’})
self.get_url(url, data)
# get post_form_id from page
url = ‘http://m.facebook.com/home.php’
html = self.get_url(url, read=1)
try:
id = re.search(‘post_form_id” value=”([^”]*)”‘, html).group(1)
except AttributeError:
print ‘Could not find post_form_id in the page, maybe wrong user/pass?’
return 10
print ‘Got post_form_id:’, id
# POST status message
data = urllib.urlencode({‘post_form_id’:id, ‘status’: status, ‘update’:’Update’})
code = self.get_url(url, data)
if code == 200:
print ‘>>> status successfully changed to: ‘, status
else:
print ‘>>> got status code:’, code
print
self.print_cookies()

def print_cookies(self):
if self.cj == None:
print “We don’t have a cookie library available – sorry.”
print “I can’t show you any cookies.”
else:
print ‘These are the cookies we have received so far :’
for index, cookie in enumerate(self.cj):
print index, ‘ : ‘, cookie
self.cj.save(self.COOKIEFILE)
[/sourcecode]

しかしこの Python スクリプトでは、フレームワークが進化している現在では正確に取得できません。（しかもモバイルサイト用）

まずは考察から、facebook crawler までいかなくともヒントを得られるまでに。

以上

関連

投稿者斉藤之雄 (Yukio Saito)

Global Information and Communication Technology OTAKU / Sports volunteer / Social Services / Master of Technology in Innovation for Design and Engineering, AIIT / BA, Social Welfare, NFU / twitter@yukio_saitoh

Written by 斉藤之雄
・世界最大の ICT ディストリビュータでシニアプリセールスコンサルタント（マルチクラウドで Data and AI 領域に強みあり）
・東京オリンピックフィールドキャスト (MED/FR)
・東京パラリンピックマラソンコースサポートリーダー
・社会福祉士（免許登録済み）
・東京都登録公認障がい者スポーツ指導員
・東京都中野区スポーツ推進委員（非常勤公務員）
・AWS認定ソリューションアーキテクトアソシエイト (2021-2024)

■Microsoft MCP 取得歴
・AZ-700(Mar/2022)★★
・MS-720 (Feb/2022)★★
・AZ-204 (Feb/2022)★★
・DA-100 (Dec/2021)★★
・Azure DevOps Engineer Expert (Dec/2021) ★★★
・AZ-400 (Dec/2021)★★★
・AZ-600 (Dec/2021)★★
・PL-200 (Oct/2021)★★
・AZ-140 (Oct/2021)★★
・SC-300 (Oct/2021)★★
・AZ-104 (Sep/2021)★★
・Azure Solutions Architect Expert (Sep/2021) ★★★
・AZ-304 (Sep/2021) ★★★
・MB-920 (Sep/2021) ★
・AZ-303 (Aug/2021) ★★★
・MS-900 (Aug/2021) ★
・SC-900 (Jul/2021) ★
・PL-900 (Jul/2021) ★
・AI-102 (Jul/2021) ★★
・DP-900 (Jun/2021) ★
・MB-901 (Jun/2021) ★
・AI-900 (May/2021) ★
・AZ-900 (Apr/2021) ★
—
■所属学会
・日本ロボット学会
・人工知能学会
・計測自動制御学会
・日本生産性本部（個人賛助会員）
—
■将来実現したいこと（Social Action）
・障害者（手帳保有に関係なく）の就労支援
・気づき難い大人の学習障害者に対する就労支援
・日本語を母国語としない方への就労支援
・成長あり共生社会
—
自宅メインマシン IdeaPad Gaming 3 シリーズ

—

[IT] facebook crawler の考察

関連

投稿者斉藤之雄 (Yukio Saito)

関連投稿

見逃しています

[受講メモ] NVIDIA GTC 2024

[資格取得] IBM Cloud for Professional Architect v6 (合格体験談）

[ICT] Lenovo IdeaPad Gaming 370 RAM 64GB 環境

[Stable-Diffusion-webui] NVIDIA GPU を持たない安価 NotePC (Windows 11)で簡単に動かす方法

関連

投稿者 斉藤之雄 (Yukio Saito)

関連投稿

見逃しています

投稿者斉藤之雄 (Yukio Saito)