Keywords

.NET (3) .rb (1) *.cod (1) 3110c (1) Algorithm (1) Amazon Cloud Drive (1) amkette (1) Android (1) Apex (6) apex:dynamic (1) API (1) API version (1) Application Development Contest (2) Artificial Intelligence (2) Atricore (1) b2g (1) Binary Search Tree (1) Blackberry Application Development (1) Blackberry Java Development Environment (1) Blender Game Engine (1) bluetooth (2) Boot2Gecko (1) bug fix (1) C (1) C++ (2) Cloud computing (1) Cloud Storage (1) Code Blocks (1) Code for a Cause (2) codejam (1) Coding (1) const_cast (1) Custom Help (1) Dancing With the Googlers (1) Data Structures (1) desktop environment (5) Doubly Linked List (1) Dropbox (1) dynamic visualforce component (1) dynamic_cast (1) Enterprise WSDL (1) Execution Context (1) fedora 14 (1) fedora 17 (5) Firefox OS (1) Flashing Nokia 3110c handset (1) Force.com (7) Gaia (1) Game Developement (1) GCC (2) GDG (2) Goank (1) Google (4) Google Developer Group (2) Google Drive (1) GTK+ (5) HACK2012 (2) Hall of Mirrors (1) help for this page (1) HTML5 (2) HTTP Web Server (1) IDE (1) Identity Provider (1) install (1) Intelligent Systems (1) Java (1) JDE (1) JOSSO (1) location based social network (1) machine learning (1) me.social (1) MinGW (1) Natural Language Processing (1) Natural Language Toolkit (1) neckphone (1) NLKT (1) Nokia Pheonix (1) Notebook (1) Numeric XML Tags (1) OAuth2.0 (1) OLPC (7) OLPC-XO-1 (7) One Laptop per Child (5) Override custom help (1) Paas (1) Partner WSDL (1) Polymorphism (1) programming contest (1) PyGTK (4) Python (11) Recycled Numbers (1) reinterpret_cast (1) Research (1) REST (1) RM-237 (1) Robotics (1) Ruby (1) Saas (2) Salesforce.com (7) scikit-learn (1) SDK (1) Service Provider (1) Single sign on (1) sklearn (1) SOAP (3) Speaking in Tongues (1) SSO Agent (1) SSO Gateway (1) static_const (1) sugar (7) sugar activity (4) sugarlabs (7) SVG (2) Symbiotic AI (1) Tabbed container (1) TCP/IP (1) TCP/IP stack (1) Typecasting (1) typeid (1) ubuntu 13.10 (1) UDP (1) Upgrade Assembly (1) Visualforce (2) Web Server (1) Web Services (3) Web2.0 (1) wikipedia (1) wikipediaHI (1) WSDL (1) XML tags (1)

Monday, January 7, 2013

WikipediaHI: Offline Wikipedia in Hindi !!





Last week I spent some time working on WikipediaHI activity for Sugar Desktop Environment. I must say it is one of the awesome activities I have come across. The best part is that it can serve you with data in offline mode. That is even if don't have internet connection which is otherwise required to access Wikipedia online, then also your WikipediaHI activity will serve your purpose.

There are lot many developers and contributors who are working in collaborative form on such awesome stuff who continuously inspire you to take up new things and create something that can be used by others in the world. Sugar developers and contributors are epitome of such group.

I came across few of such developers, Anish Mangal and Gonzalo Odiard, two of them whose contributions are significant for Sugar. I took up the task of creating WikipediaHI using Wikipedia dump for Hindi available for free. I followed the steps specified on this page[ hosted by Gonzalo] for creating Wikipedia activity in your own language.

I will quickly explain the steps I took to create WikipediaHI:

1) Downloaded the Wikipedia dump file for Hindi:
http://dumps.wikimedia.org/hiwiki/20121225/hiwiki-20121225-pages-articles.xml.bz2
NOTE: [ Make sure you pick the valid latest file from here : http://dumps.wikimedia.org/hiwiki/   this location will show you listing as per dates. Pick the latest dump and proceed further.]

and downloaded WikipediaBase from this link

2) Created "hi" directory for HINDI under WikipediaBase directory and moved the downloaded dump to this folder.

3) Extracted contents of this file using:
bzip2 -d hiwiki-20121225-pages-articles.xml.bz2

4) Processed the dump using page parser:
../tools2/pages_parser.py

The result of this operation will generate these files:
hiwiki-20121225-pages-articles.xml.links
hiwiki-20121225-pages-articles.xml.page_templates
hiwiki-20121225-pages-articles.redirects
hiwiki-20121225-pages-articles.templates

5) Then you can include selective articles or all articles from this dump to your activity by using this command:
../tools2/make_selection.py
* Make sure you have favorites.txt and blacklist.txt filled with appropriate keywords.

Now if you want to include all articles use this command:
../tools2/make_selection.py --all

6) Then proceed to create the index for these articles:
../tools2/create_index.py

7) In order to test the index created in previous step you can use this command:
../tools2/test_index.py

8) Next step is to expand the templates of articles :
cd ..
./tools2/expandtemplates.py hi

9) Go back to hi directory and re-create the index :
cd hi
mv hiwiki-20121225-pages-articles.xml.processed_expanded hiwiki-20121225-pages-articles.xml.processed
../tools2/create_index.py --delete_all

10) Download the images for the articles you selected:
cd hi
../tools2/download_images.py

if you want to download the images for pages you selected in previous step:
../tools2/download_images.py --all

11) Create files specific to language:
(a)activity/activity.info.lang : activity info file for you language activity
(b)activity/activity-wikipedia-lang.svg : activity icon for your language
(c)activity_lang.py : activity file for your language
(d)static/about_lang.html : about page for wikipedia in your language.
(e)static/index_lang.html : index page for wikipedia in your language. This is the page displayed when activity is launched. So its important for you to know the articles included in the search.db ( generated when index is created) for you to create the index page.


12) Create the XO file for wikipedia in your language:
./setup_new_wiki.py hi/hiwiki-20121225-pages-articles.xml

I went through the search.db file to identify the articles present in it and create the index page accordingly.
This gave me an idea to write some script that can generate index page(part or whole) to be used as home page for activity using search.db[ Stay tuned for next blog on this idea]

Here you go.. you can see WikipediaHI

On launching this, you can see the index page listing the articles you can view offline using WikipediaHI

If you want to play with WikipediaHI, you can download it : WikipediaHI-35.xo

I must thank Gonzalo for his amazing help and guidance in getting this done. I have to mention here that Wikipedia
changed its XML format in their dumps which resulted in error when I was creating the index. I took Gonzalo's help to get it resolved.
Thanks to Anish, who motivated me to pick this up and guided me to complete it.

Thanks guys !! :D

No comments: