johnyma22
Newbie

Posts: 18
|
 |
« on: January 22, 2012, 07:56:18 PM » |
|
 My first read database gets all of the values from the documents (20k) My second read database(1k documents) has a value isGood = 1 if the value is good, -2 if the value is bad and a bunch of other really bad ideas.. I set isGood to label. Should I actually only be passing true/false or is an integer okay? I use nominal to text to get the "data" field as text. I then process the document, looking for word frequencies etc. Is my Naive bayes even in the right place? My end goal is that I feed it 1000 known good documents and it can find very similar documents from the first read database... I want my confidence score to be based on document similarity. I am getting an output that contains confidence but I'm not sure how to present my output, I don't come from a statistical background so I'm learning on my feet. I appreciate I have a lot to learn so in 3 weeks time I'm going to read some books/content about how to use rapidminer and ML in general. I can only apologize for my ignorance! TLDR;Can I use an integer as a label? Am I using naive bayes and apply model correctly? How can I view my data in an easy to interpret way. Ideally something like a list of document IDs with their confidence rating. Thanks guys!
|
|
|
|
« Last Edit: February 03, 2012, 08:47:06 PM by johnyma22 »
|
Logged
|
|
|
|
misanthropic789
Newbie

Posts: 8
|
 |
« Reply #1 on: January 22, 2012, 08:48:51 PM » |
|
If I understand what you are saying, you have your read's backwards. Your 1000 records is your training date set; that is what you know is correct and that you want to use to define the model. You should read that in first and feed that to the naive bayes process. Your full data set is what you want to score/apply the model to, so you should feed that in second and then apply the model to it. That will provide you a score for each record in your full data set that will tell you both whether the naive bayes model thinks your record is good or not and the level of confidence (0-1) that it has in that assessment.
As far as your variables go, I don't think there is a technical reason why you can't use integers, however the spread of your variables is odd. I would use 1 and 0 (1 is good, 0 is not good) if I were using integers. Someone else will need to say whether there needs to be a numeric to nominal process in there on your label. That is how my job is set up.
Regarding output, what you need to do is save the output of the apply model, either to a csv file or to the repository. Then you can extract the fields you need from it (ID and prediction(yes).
BTW, I'm one step less of a newbie than you are, so I hope others will jump in and correct both of us. However I am sure about your read's being backwards so you should start with fixing that.
|
|
|
|
|
Logged
|
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #2 on: January 23, 2012, 04:13:14 AM » |
|
Now when I attempt to run I get:
The learning scheme naive bayes does not have sufficient capabilities for handling an example set with only one label
But "include special attributes" is ticked, so is keep text and add meta information, any idea what I could be doing wrong?
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #3 on: January 23, 2012, 09:26:31 AM » |
|
Hi,
some additions from my side:
- are you sure that your training data contains more than one value for isGood? If it contains only examples of one class, that could cause the error message.
- For Text Processing it is very important to use the same word list for training and application. Thus you have to connect the "wor" output of the Process Documents operator in the training branch to the "wor" input in the application branch. That way it is guaranteed that training and application example sets contain the same word vectors.
- do your integer values in isGood imply an order, or are they actually categories? In the latter case you should convert the label to a nominal value, so Naive Bayes will perform a classification. If it is left to Integer, it will perform a regression.
Best, Marius
|
|
|
|
« Last Edit: January 25, 2012, 10:44:22 AM by Marius »
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title) Please click here before posting.
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #4 on: January 24, 2012, 08:44:55 PM » |
|
Hey guys, so I made some progress. I extended my DB structure to support a label field and set any that are known positive matches as true and any known negatives as false. I use these MySQL select queries: SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50This select gets the items with a label true and false. Naive Bayes learnes from these. SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != 1 AND isGood = 0 ORDER BY score desc LIMIT 0,10 This select gets all of the items that dont have a true or false labelThis select gets all of the items that dont have a true or false label. My output data doesn't have any confidence rating. Should it? It looks like this:  Thanks! PS if someone could add me on skype/other IM service I'd be happy to screen share and work on this in real time?
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #5 on: January 25, 2012, 10:47:03 AM » |
|
Hi,
if you applied the classification model: yes, your output should contain predictions and confidences. It would be helpful if you posted your process as XML here, so we can check the setup. You get the XML code via the XML tab at the top of the process view in RapidMiner. Just copy the text from there into your next answer, and please use the #-button on top of the input box for that.
Best, Marius
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title) Please click here before posting.
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #6 on: January 25, 2012, 04:15:28 PM » |
|
Here ya go <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.1.014"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process"> <process expanded="true" height="369" width="835"> <operator activated="true" class="read_database" compatibility="5.1.014" expanded="true" height="60" name="Read Database" width="90" x="45" y="210"> <parameter key="connection" value="slave2"/> <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != "true" AND isGood = 0 ORDER BY score desc LIMIT 0,10"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="read_database" compatibility="5.1.014" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="75"> <parameter key="connection" value="slave2"/> <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="75"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="data"/> <parameter key="include_special_attributes" value="true"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="999"/> <list key="specify_weights"/> <process expanded="true" height="480" width="815"> <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/> <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/> <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/> <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/> <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120"> <parameter key="min_chars" value="2"/> </operator> <connect from_port="document" to_op="Extract Content" to_port="document"/> <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/> <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/> <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/> <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="90"/> <portSpacing port="sink_document 1" spacing="90"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="data"/> <parameter key="include_special_attributes" value="true"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="999"/> <list key="specify_weights"/> <process expanded="true"> <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/> <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/> <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/> <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/> <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)"> <parameter key="min_chars" value="2"/> </operator> <connect from_port="document" to_op="Extract Content (2)" to_port="document"/> <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/> <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/> <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/> <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/> <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/> <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="set_role" compatibility="5.1.014" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="75"> <parameter key="name" value="label"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="naive_bayes" compatibility="5.1.014" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="75"> <parameter key="laplace_correction" value="false"/> </operator> <operator activated="true" class="apply_model" compatibility="5.1.014" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="210"> <list key="application_parameters"/> </operator> <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/> <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/> <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/> <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/> <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/> <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/> <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="90"/> </process> </operator> </process>
Editable here http://beta.etherpad.org/p/rapidminer
|
|
|
|
|
Logged
|
|
|
|
|
Ingo Mierswa
|
 |
« Reply #7 on: January 25, 2012, 04:39:52 PM » |
|
Hi there, please try this one and let me know if it works: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.1.017"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process"> <process expanded="true" height="369" width="835"> <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30"> <parameter key="connection" value="slave2"/> <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="data"/> <parameter key="include_special_attributes" value="true"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="999"/> <list key="specify_weights"/> <process expanded="true" height="480" width="815"> <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/> <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/> <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/> <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/> <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120"> <parameter key="min_chars" value="2"/> </operator> <connect from_port="document" to_op="Extract Content" to_port="document"/> <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/> <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/> <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/> <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="90"/> <portSpacing port="sink_document 1" spacing="90"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="30"> <parameter key="name" value="label"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="naive_bayes" compatibility="5.1.017" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30"> <parameter key="laplace_correction" value="false"/> </operator> <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database" width="90" x="45" y="210"> <parameter key="connection" value="slave2"/> <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != "true" AND isGood = 0 ORDER BY score desc LIMIT 0,10"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="data"/> <parameter key="include_special_attributes" value="true"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="999"/> <list key="specify_weights"/> <process expanded="true"> <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/> <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/> <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/> <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/> <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)"> <parameter key="min_chars" value="2"/> </operator> <connect from_port="document" to_op="Extract Content (2)" to_port="document"/> <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/> <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/> <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/> <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/> <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/> <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="210"> <list key="application_parameters"/> </operator> <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/> <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/> <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/> <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/> <connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/> <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/> <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/> <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="180"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Cheers, Ingo
|
|
|
|
|
Logged
|
Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at http://marketplace.rapid-i.com
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #8 on: January 25, 2012, 04:43:45 PM » |
|
Exactly the same output, no confidence etc.
|
|
|
|
|
Logged
|
|
|
|
|
Ingo Mierswa
|
 |
« Reply #9 on: January 25, 2012, 05:03:53 PM » |
|
Hi, are you sure that you have pressed the green check icon after inserting the XML (I frequently forget this  ). The difference is really small: I just have connected the output port with the word list of the first operator for text processing with the input port for the word list of the second one. This is definitely necessary, since otherwise the resulting example sets would differ and a prediction is not possible then. This should actually also be stated in the log, by the way. Another thing which cames into my mind is the fact that your query delivers an attribute label, which get the role "label" during training but not during testing. Remove this or also set the role to label before model application. Here is the suggested process: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.1.017"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process"> <process expanded="true" height="369" width="835"> <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30"> <parameter key="connection" value="slave2"/> <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="data"/> <parameter key="include_special_attributes" value="true"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="999"/> <list key="specify_weights"/> <process expanded="true" height="480" width="815"> <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/> <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/> <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/> <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/> <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120"> <parameter key="min_chars" value="2"/> </operator> <connect from_port="document" to_op="Extract Content" to_port="document"/> <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/> <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/> <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/> <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="90"/> <portSpacing port="sink_document 1" spacing="90"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="30"> <parameter key="name" value="label"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="naive_bayes" compatibility="5.1.017" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30"> <parameter key="laplace_correction" value="false"/> </operator> <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database" width="90" x="45" y="210"> <parameter key="connection" value="slave2"/> <parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != "true" AND isGood = 0 ORDER BY score desc LIMIT 0,10"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="data"/> <parameter key="include_special_attributes" value="true"/> </operator> <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="999"/> <list key="specify_weights"/> <process expanded="true"> <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/> <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/> <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/> <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/> <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)"> <parameter key="min_chars" value="2"/> </operator> <connect from_port="document" to_op="Extract Content (2)" to_port="document"/> <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/> <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/> <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/> <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/> <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/> <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (3)" width="90" x="447" y="210"> <parameter key="name" value="label"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="120"> <list key="application_parameters"/> </operator> <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/> <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/> <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/> <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/> <connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/> <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/> <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/> <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Set Role (3)" to_port="example set input"/> <connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="90"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
If this things are not the reason, I am afraid I would have to look into the data and the transformed data (i.e. the two example sets which are actually delivered to the learner - do they really contain regular attributes? Are those the same for training and testing? Cheers, Ingo
|
|
|
|
|
Logged
|
Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at http://marketplace.rapid-i.com
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #10 on: January 25, 2012, 05:08:01 PM » |
|
I didn't check the green check, however when I did I still get the same output.
If you want I can share my screen via skype and we can make modifications in real time?
Skype: johny_mac
|
|
|
|
|
Logged
|
|
|
|
|
Ingo Mierswa
|
 |
« Reply #11 on: January 25, 2012, 05:18:08 PM » |
|
Hi, I didn't check the green check, however when I did I still get the same output.
Even with the second one process with the additional Set Role operator? Weird. Now indeed we have to inspect the data delivered to the learner and apply model operators (see questions below). If you want I can share my screen via skype and we can make modifications in real time?
Yeah, sounds like fun but I am out of office and surfing only via my mobile phone. And usually I charge 200 Euro per hour for this type of consulting (but be assured: we have some junior consultants which are less expensive  ). But let's face it: although we are usually not working on an per-hour base maybe at some point of time this would indeed be the most time-efficient thing to do if you or others here do not find the reason... If somebody else has more time and wants to dive deeper into this: the next thing I would check is what is delivered to the learner (see my questions below) and to the operator Apply Model together with the log messages. If the dimension is really high, maybe another learner would also be more appropriate. Just my 2c. Cheers, Ingo
|
|
|
|
|
Logged
|
Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at http://marketplace.rapid-i.com
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #12 on: January 25, 2012, 07:27:13 PM » |
|
I'm happy to paypal some cash over, I would expect it is only a 5 minute job as my task is so simple and I'm probably only missing a checkbox somewhere! Would anyone be willing to just do it as a side job and not charge the 200 euros per hour but maybe 20 euros for 5 minutes of your time or maybe I can donate some money to charity or to your favorite open source project? 
|
|
|
|
|
Logged
|
|
|
|
johnyma22
Newbie

Posts: 18
|
 |
« Reply #13 on: February 01, 2012, 07:56:05 PM » |
|
Just a note. I'm still stuck with this. I'm sure the process is working fine but I'm not able to interpret the results correctly
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #14 on: February 02, 2012, 07:36:37 PM » |
|
Hi,
as Ingo said above: please check your data, and also your SQL queries. To me it seems a bit odd that you said that you want to use isGood as label, but are fetching a label column from the database. Next, in your screenshot of the data the columns for label and isGood are almost empty. Please check that you are fetching correct data sets by putting a breakpoint on the Read Database operators.
Best, Marius
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title) Please click here before posting.
|
|
|
|