Process Name: JMDict_bipartite_analysis.exe
User Name: Jeff
CPU: 99%
Mem Usage: 163,572K (and rising)
My computer has slowed to a crawl as I’ve been testing different ways to improve the data set for my project.
I expect you probably want to know exactly what it is I’m doing, so I’ll explain that first.
I’m making a graph of every word in several languages, and the connections between them. I then take this graph and feed it into an algorithm which gives me communities, or clusters within this graph– words and phrases that are related to “seed” words or phrases that I enter as input. At the moment, the graph includes Japanese, English, German, and some French and Russian. I only have a dictionary file for links between Japanese and the other four. If someone can find me public-domain dictionary files between any of the non-English languages, that would be wonderful.
At present, the graph contains about 297 thousand nodes (entries), up from 262 thousand nodes before I started expanding the data set yesterday. I also just finished expanding the number of edges (connections) in the graph from around 500 thousand to more than 1.2 million. Right now I’m debugging a procedure to automatically find and remove noisy entries, ones like “to” and “and” which have tens of thousands of links to other words but no real value for this type of graph.
http://sourceforge.net/projects/stardict/
free dictionary files