📚 OpenRCE is preserved as a read-only archive. Launched at RECon Montreal in 2005. Registration and posting are disabled.








Flag: Tornado! Hurricane!

Blogs >> sp's Blog

Created: Monday, January 22 2007 15:27.44 CST Modified: Monday, January 22 2007 15:27.44 CST
This is an imported entry. View original. Printer Friendly ...
Data-mining Wikipedia II
Author: sp # Views: 2150

Here are some more details about the program I used to create some graphs from Wikipedia two days ago. The C# source code is now available. The program takes five command-line parameters.
  • The name of the Wikipedia XML input file. You can download it here. Its the 1.8 GB pages-articles file.
  • A number that specifies how many nodes you want in the final graph. 25 - 50 are reasonable values.
  • The keyword to search for (like Egypt for example). Note that this parameter is case-sensitive.
  • A number that specifies how many sentences are considered when searching articles for the keyword.
  • The name of the output file. This file is a graph definition file that can be turned into a graph using dot.exe from the GraphViz package (like dot.exe -Tsvg output.txt > graph.svg for example).
I toyed around with the parameters for a while and here are a few more things I noticed:
  • Searching through the first three sentences of all articles seems to produce very nice results for most keywords.
  • If the keyword is relatively rare (for example "Australian rules football") its OK to search through the entire article (set the sentences parameter to 0). Dont do this for popular keywords though or youll end up with a graph that shows articles that are objectively important but only tangentially related to the keyword. If you do a full-article search for "Germany" for example you end up with a graph full of nodes containing the names of other European countries that played a role in Germanys history. Thats because articles about countries have a high importance and all European countries were somehow important to Germany in the last few thousand years.
  • Trying to use the size of articles as an indicator of their relative importance didnt work out. Look at this 4,000 words treatise on the Goomba and compare it to the page for Niels Bohr which is only half as long. This should give you a first idea about the potential problems. Theres still legacy code in the app from where I tried that idea. Its easy to enable again it but you need to recompile the app.
  • Trying to use the position of the keyword compared to the size of the article didnt work either. Long articles still have too much weight. A better idea might be to give an article weight 10 if the keyword appeared in the first 10% of the article, weight 9 if it appeared in the next 10%, and so on. I didnt try that though.
  • Different parameters lead to different meanings of the resulting graphs. If you do a 3-sentences search for LSD the graph shows information about the drug itself and its history. If you do a full-text search one half of the graph is dominated by rock stars.
Here are a few other graphs which turned out particularly well:

The key to cool graphs is to choose a keyword that has lots of articles which nevertheless belong closely together. An example for a bad keyword is "Mathematics". There are thousands of math-related articles in Wikipedia but they dont belong closely together because math is a huge and fragmented field. The resulting graphs of keywords like math degenerate into trees or unconnected subgraphs.

Generating a graph takes approximately 5 minutes on my computer. In most cases nearly all the time is spent on parsing the 8 GB XML file. Generating the actual graph is nearly always a matter of seconds. Only for keywords like Germany or America which have some ten-thousand relevant articles generating the graph takes a few more minutes.



If you wish to comment on this blog entry, please do so on the original site it was imported from.

There are 31,328 total registered users.


Recently Created Topics
[help] Unpacking VMP...
Mar/12
Reverse Engineering ...
Jul/06
let 'IDAPython' impo...
Sep/24
set 'IDAPython' as t...
Sep/24
GuessType return une...
Sep/20
About retrieving the...
Sep/07
How to find specific...
Aug/15
How to get data depe...
Jul/07
Identify RVA data in...
May/06
Question about memor...
Dec/12


Recent Forum Posts
Finding the procedur...
rolEYder
Question about debbu...
rolEYder
Identify RVA data in...
sohlow
let 'IDAPython' impo...
sohlow
How to find specific...
hackgreti
Problem with ollydbg
sh3dow
How can I write olly...
sh3dow
New LoadMAP plugin v...
mefisto...
Intel pin in loaded ...
djnemo
OOP_RE tool available?
Bl4ckm4n


Recent Blog Entries
halsten
Mar/14
Breaking IonCUBE VM

oleavr
Oct/24
Anatomy of a code tracer

hasherezade
Sep/24
IAT Patcher - new tool for ...

oleavr
Aug/27
CryptoShark: code tracer ba...

oleavr
Jun/25
Build a debugger in 5 minutes

More ...


Recent Blog Comments
nieo on:
Mar/22
IAT Patcher - new tool for ...

djnemo on:
Nov/17
Kernel debugger vs user mod...

acel on:
Nov/14
Kernel debugger vs user mod...

pedram on:
Dec/21
frida.github.io: scriptable...

capadleman on:
Jun/19
Using NtCreateThreadEx for ...

More ...


Imagery
SoySauce Blueprint
Jun 6, 2008

[+] expand

View Gallery (11) / Submit