Attributes inference using centrality measure in online social networks

For page specific messages
For page specific messages


I am attempting the future direction given in paper "inferring  attributes via social relation". They have written  if one use "closeness" and "betweenness" measure, the accuracy of attributes inference will be improved.


I have implemented their approach for inference and attaching it with this post.



How will we use closeness and betweeness measure for getting improved results? 

Attachments: 2.6 Mb [0 download(s)]
Above paper can be downloaded from here.

Before looking into the algorithm, I wanted to repeat the steps successfully.

So, I followed the steps in README file. [Needed some python package installation and minor path modification in your scripts.]

In step (7), I gave 10 as % to hide.

Output file generated was named 'hide_10.gml'. Step 8 seems incorrect, and perhaps not needed.

For next step, I used hide_10.gml, but got the following erros -


  node [
Expected "]" (at char 174), (line:8, col:3)
Traceback (most recent call last):
  File "./", line 5, in <module>
  File "<string>", line 2, in read_gml
  File "/usr/local/lib/python2.7/site-packages/networkx/utils/", line 263, in _open_file
    result = func(*new_args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/networkx/readwrite/", line 85, in read_gml
  File "/usr/local/lib/python2.7/site-packages/networkx/readwrite/", line 136, in parse_gml
    tokens =gml.parseString(data)
  File "/usr/lib/python2.7/site-packages/", line 1032, in parseString
    raise exc
pyparsing.ParseException: Expected "]" (at char 174), (line:8, col:3)

Can you check why the error?

We need to do these two modification in "" file : 1) Replace this lines f=open("final.gml","r") with f=open("hide_10.gml","r") And 2)Replace next line i.e. e=open("final1.gml","w") with e=open("hide.gml","w")

What is the difference between Accuracy and % of user inferred by above three scheme? is both same? if not then how to calculate the accuracy?

Columns 10%, etc refer to 4 data sets obtained after hiding respective % of gender info hidden.

Cell entries are computed accuracies using 3 methods for corresponding data sets (or cols).

Computed accuracy for $cell_{ij}$ is computed as:

$$Acc(cell_{ij}) = \frac{\text{no. of correctly predicted gender in (data set)}_j [\text{using method}_i]}{\text{sizeof((data set)}_j)*(\% \text{val of col} j / 100)}$$

e.g. $$ 0.6489 = \frac{\text{no. correctly predicted gender in (data set)}_{10\% \text{hidden gender}} [\text{using Global Method}]}{\text{sizeof(data set)}_{10\% \text{hidden gender}}*(10/100)}$$


Above post is not much clear. Could you please tell me, how to calculate accuracy for below information:- Suppose, we have size of data set=6600 then 10% of it is 660 we run global method and predicted the gender of 430 user correctly. Then what is the accuracy of this?

data size = 6600

task = predict missing 660; known = $(6600 - 660) = 5940$.

correct prediction 430.

If you use

$$ accuracy = \frac{ known + correct}{data size} = \frac{5940+430}{6600}$$

it will hide (by scaling down due to large data size) the fluctuations in different data sets.

Taking $correct/hidden$ ignores the data set size unless I specify it is at 10%. While this ratio seems ok for algos with linear improvement in accuracy with respect to data set size, it may not be suitable for algos that are sensitive to data set size.

So, use $correct/hidden$, but check if the algo is used for larger data set (with again 10% hidden), how does it behave.

From wikipedia:
\text{accuracy}=(\text{sensitivity})(\text{prevalence}) + (\text{specificity})(1-\text{prevalence})
The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall