Skip to content

Latest commit

 

History

History
34 lines (18 loc) · 2.53 KB

README.md

File metadata and controls

34 lines (18 loc) · 2.53 KB

Wikipedia-Browser

Below I plotted the Greek Wikipedia articles in a 2d dynamic representation. First I created a vector representation of the documents (mainly tf-idf) and then I applied dimensionality reduction techniques in order to reduce the vector space to two dimensions. Clustering analysis is also applied at some examples. It is interesting to see that similar documents are plotted close to each other, even though I didn't work on feature extraction from the documents very long.

Due to a change in the plotting library bokeh all the dynamic plotts stopped working. Thankfully there is a workaround and I will fix soon. For now only the first notebook in the lists works.

Notebooks:

Interactive notebooks aren't supported in github, so nbviewer is used instead.

Text representation

TF-IDF

  • Wikipedia visualization of all the articles: nbviewer link [16.5 MB]

  • Wikipedia visualization of top 100 categories: nbviewer link [24.7 MB]

  • Wikipedia visualization of all articles with their top category: nbviewer link [37.1 MB]

Clustering

K-means

  • Clustering on the above tfidf using kmeans for k = 8 clusters : nbviewer link [75.4 MB]

  • Clustering a tf-idf reduced in 2 dimensions for k = 8 clusters- experimental: nbviewer_link [37.7 MB]

LSI Topic Modeling

  • Clustering the dataset using topic modeling. K = 12 : nbviewer link [37.3 MB]

DBSCAN (on low-d matrix)

  • Clustering a 2-d dimensionality reduced matrix, using dbscan with no specific attributes. Clusters generated = 155 : nbviewer link [74.8 MB]