Abstract
I'm interested by Social Networks and decided to study a case based on a Polaroid Fan Community called polanoid.net.
It is a community based on polaroid scan uploads made by all the users.
Users can post polaroid, comment and favorize polaroid posted by other users and create friendship link between them and the others.
The aim was to create a map based on all friendship links.
I collected the data set the Monday, October 15th 2007 between 12:00 PM and 1:00 PM
The Method I used
I divided this work in two steps:
- data mining/collection
- visualization
1/ Data mining
As I didn't have any access to the database like each site visitor, I decided to collect data by using HTTP protocol via a PHP script that I coded.
All data have been collected with a fragmented collection because of PHP script timeout configured by my hosting service and I cannot change it.
I decided to collect the following datas in order of further analysis:
- uid (a unique identificator ; given/created when a visitor signs up)
- name (the name of the user)
- sex (male, female or unsure when the user doesn't choose male or female)
- date of registration (the day when the user registered his account)
- list of friends (list of all the friends the user)
I used a MySQL Database to store all the collected data because of the heaviness of the data-mining in this case.
Indeed, as in every collection case, we need to filter a huge volume of text.
For instance, in that case, I want to know all the friend from a user:
* The script asked a HTML page
* The script read ALL the lines
* The script was able to recognize special pattern in order to find the information I needed by using regular expressions.
* The script wrote in the Database all the collected data.
These processes were very time expensive because of the method & the very huge size of the data-set (8014 users, about 9500 friendship links)
2/ Visualization
I created another PHP script in order to extract all data stored in the Database and to make them readable by the powerful yED - the visualization tool I choosed.
I decided to export all data in GML file format for further studies.
I filtered my export with some criterias:
- I dropped all nodes that weren't involved in relationship link (= all user who doesn't have any friend & who wasn't consider as a friend by anyone)
- I considered a reciprocal friendship relation as a stronger relation than an unilateral friendship relation (= strong link are thicker than the other)
- each nodes have a color : pink for female, blue for male and grey for "unsure" which represents people who didn't complete this in the registration form.
After this filtering script, I produced one GML file containing 1526 nodes & 6718 relationship links (including a mix of reciprocal links and unilateral links) .
First note: only 1526 users over a total of 8014 users are involved in friend relationtionship.
This file had been parsed by yED to produce a lot of visualization by using a lot of layout customized by my self.
I decided to not analyze this community further and to stay in visualization/graphic creation/art domain but some obvious analysis appeared - I mean, I didn't analyse this data set with a tool like Excel in order to produce real statistics results, I just stayed in visualization domain.
A/ centrality analysis
I made a centrality analysis which considers as a center people with the biggest number of friends or who are the most considered as a friend.
I made 2 views:
- an ordered circle (At the 3 o'clock position, the first registered... and in the counterclockdirection, the others ordered by registration date)
- a hierarchical view
For each one of these views, I considered as the center:
-People with the biggest number of friends (i.e "the big friendmaker")
-People the most considered as a friend by the others (i.e "the most loved")
Centrality Hierarchical view zoomed where the center is people the most considered as a friend
|
|
Centrality Hierarchical view zoomed where the center is people with the biggest number of friends |
|
|
Centrality Circular view where the center is people the most considered as a friend
|
|
|
Centrality Circular view where the center is people with the biggest number of friends
|
B/ Other Visualization with a "circle ordered" layout
All nodes are placed on thesame kind of ordered circle as previous views.
Indeed, people with the
most number of relationship link are the older registrated, but there are exceptions.
I produced 2 edge-colored versions too.
Without edges coloration
|
|
Edge colored as source of relationship link (i.e edge colored as the friendship maker) |
|
|
Edge colored as target of relationship link
(i.e edge colored as the friend of the friendship maker) |
The future
A very nice prospect could be the creation of a real-time interface to navigate in the users cloud.
We could explore relationships link (real-time created ; around 60 relationship are created each day)
This kind of interface would require a direct access to the Database, a huge requests to this Database too...