Pages: [1]
  Print  
Author Topic: JAPANESE Tokenizing  (Read 1988 times)
turutosiya
Newbie
*
Posts: 2


WWW
« on: April 05, 2011, 02:57:07 AM »

Hi.

I'm a niewbie at RapidMiner.

I'm trying to mining some webpages with "GetPage", "Extract Content" And "Process Documents".
It seems work well for ENGLIUSH pages, but for JAPANESE pages, tokenizer doesn't work well,

Japanese tokenize is not supported?
Logged

Toshiya TSURU <t_tsuru@sunbi.co.jp>
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #1 on: April 21, 2011, 04:33:45 PM »

Hi,
not really and as I'm not an expert on Japanese, I don't have a clue how we should do this, they don't have whitespaces, do they?
How is determined where a word ends?

Greetings,
  Sebastian
Logged
Neil McGuigan
Jr. Member
**
Posts: 65


WWW
« Reply #2 on: April 21, 2011, 07:15:09 PM »

you will probably want to tokenize using the regular expression mode, with the regular expression matching all characters. this should tokenize the document on every character, which i believe is what you want with japanese and chinese.

you should also try the Text Processing > Transformation > Generate n-Grams (Characters) operator
Logged

karlrb
Newbie
*
Posts: 4


« Reply #3 on: April 22, 2011, 06:13:42 AM »

If I can be of any help, I would be happy to look into any specific questions on this subject.  My wife is Japanese and I'm in the process of learning Japanese - amazingly complex.

Karl Bergerson
Seattle WA USA
karl.bergerson@gmail.com
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #4 on: April 26, 2011, 09:50:27 AM »

Hi Karl,
you are very welcome if you can come up with a good algorithm for japanese tokenization!

With kind regards,
  Sebastian
Logged
turutosiya
Newbie
*
Posts: 2


WWW
« Reply #5 on: March 28, 2013, 01:08:32 PM »

Hi All.

It's beeeeeen a really long time to start this proj. at last, I have time to try.

I'm looking for document which describing API spec for Tokenizer.
does anyone know?

I'm trying to implement a JapaneseTokenizer which work with morphological analysis engine, such as Chasen / Mecab.
Logged

Toshiya TSURU <t_tsuru@sunbi.co.jp>
Pages: [1]
  Print  
 
Jump to: