johnyma22
Newbie

Posts: 18
|
 |
« Reply #15 on: February 02, 2012, 11:57:18 PM » |
|
As requested I extended my database schema creating a field=label that was "true" / "false" I actually get this in my results which I think means something is working right:  Can anyone please confirm? Thanks
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #16 on: February 03, 2012, 12:19:47 PM » |
|
At least it does not look wrong... this is the Naive Bayes model created by the Naive Bayes operator. More interesting would be the distribution table of that model (access it via the radio buttons in the results view). But you are probably more interested in the labelled result set. Thus, you have to connect the lab output of Apply Model to the result output. Anyway, the process Ingo posted should work.
If you still don't get valid results, again check the following:
Did you: - connect the wordlist output of the Process Documents output in the training branch to the input of Process Documents in the Apply branch? - did you double check that you read correct data from both Read Database operators? - if you don't use isGood, don't retrieve it from the database. - find out why the label attribute is empty after Process Documents, and try to fix it. Is already empty directly after the Read Database operators?
Best, Marius
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #17 on: February 03, 2012, 12:44:59 PM » |
|
Dist model looks like this:  In answer to your questions: Yes. Yes. Removed isGood I'm running some more tests now, will reply once they are completed. Thanks 
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #18 on: February 03, 2012, 01:30:48 PM » |
|
I would claim that you changed your SQL statement and don't fetch a "data" attribute with the text anymore, but your text attributes are now called "Title" and "Description". Thus, the Nominal to Text operators have to be adapted such that they don't operate on "data", but on the two new attributes. If you have only text attributes and the label, you could use "filter type" all and uncheck "include special attributes". Didn't you get a warning or error in the "Problems" view at the bottom of RapidMiner saying sth like "The example set must contain at least one text attribute"?
Best, Marius
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #19 on: February 03, 2012, 02:20:39 PM » |
|
SQL statements only get Data. Include special attributes not checked. Didn't get any warnings.. View:  XML is this: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Process"> <parameter key="parallelize_main_process" value="true"/> <process expanded="true" height="386" width="835"> <operator activated="true" class="read_database" compatibility="5.2.000" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30"> <parameter key="connection" value="slave2"/> <parameter key="query" value="SELECT label, data, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,100"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.2.000" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="data"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="999"/> <list key="specify_weights"/> <process expanded="true" height="480" width="815"> <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/> <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/> <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/> <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/> <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120"> <parameter key="min_chars" value="2"/> </operator> <connect from_port="document" to_op="Extract Content" to_port="document"/> <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/> <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/> <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/> <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="90"/> <portSpacing port="sink_document 1" spacing="90"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="read_database" compatibility="5.2.000" expanded="true" height="60" name="Read Database" width="90" x="45" y="210"> <parameter key="connection" value="slave2"/> <parameter key="query" value="SELECT data, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != "true" AND isGood = 0 LIMIT 0,100"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.2.000" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="data"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="999"/> <list key="specify_weights"/> <process expanded="true"> <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/> <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/> <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/> <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/> <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)"> <parameter key="min_chars" value="2"/> </operator> <connect from_port="document" to_op="Extract Content (2)" to_port="document"/> <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/> <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/> <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/> <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/> <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/> <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="set_role" compatibility="5.2.000" expanded="true" height="76" name="Set Role" width="90" x="447" y="30"> <parameter key="name" value="label"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="naive_bayes" compatibility="5.2.000" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30"> <parameter key="laplace_correction" value="false"/> </operator> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="120"> <list key="application_parameters"/> </operator> <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/> <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/> <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/> <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/> <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/> <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/> <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Set Role" from_port="example set output" to_op="Naive Bayes" to_port="training set"/> <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/> <connect from_op="Apply Model (2)" from_port="model" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="90"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #20 on: February 03, 2012, 02:40:35 PM » |
|
You are using Extract Content, which is for HTML documents only. Does your database contain HTML documents? What happens if you remove those operators?
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #21 on: February 03, 2012, 02:43:07 PM » |
|
It contains both HTML documents and Text extracted from PDF documents.
|
|
|
|
|
Logged
|
|
|
|
|
haddock
|
 |
« Reply #22 on: February 03, 2012, 03:19:29 PM » |
|
Your current XML does not set the label role on the test set, but it does on the training set. I refer you to earlier posts in this thread from Ingo and to the help for this operator... Please pay attention to the fact, that the application of Models will need the same attributes during application on an ExampleSet that where part of the ExampleSet it was trained on. Some minor changes like adding attributes might be possible, but might cause severe calculation errors. Please make sure, that the attributes' number, order, type and role are consistent during training and application.
|
|
|
|
|
Logged
|
Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?
T.S.Eliot ~ Choruses from the Rock 1934
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #23 on: February 03, 2012, 08:08:22 PM » |
|
Okay great I have some results now  Thanks guys. I think it might be healthy for other people if I keep this thread open with my requests for how to interpret the data.
|
|
|
|
|
Logged
|
|
|
|
|
haddock
|
 |
« Reply #24 on: February 03, 2012, 08:26:32 PM » |
|
I disagree.
|
|
|
|
|
Logged
|
Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?
T.S.Eliot ~ Choruses from the Rock 1934
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #25 on: February 03, 2012, 08:46:51 PM » |
|
Fair enough.. I will go ahead and fragment the learning process for people in the future.
|
|
|
|
|
Logged
|
|
|
|
|