Pages: 1 [2]
  Print  
Author Topic: [SOLVED] Really basic question, I think I'm applying models wrong.  (Read 609 times)
johnyma22
Newbie
*
Posts: 18


« Reply #15 on: February 02, 2012, 11:57:18 PM »

As requested I extended my database schema creating a field=label that was "true" / "false"

I actually get this in my results which I think means something is working right:



Can anyone please confirm?

Thanks
Logged
Marius
Global Moderator
Sr. Member
*****
Posts: 370



WWW
« Reply #16 on: February 03, 2012, 12:19:47 PM »

At least it does not look wrong... this is the Naive Bayes model created by the Naive Bayes operator. More interesting would be the distribution table of that model (access it via the radio buttons in the results view). But you are probably more interested in the labelled result set. Thus, you have to connect the lab output of Apply Model to the result output. Anyway, the process Ingo posted should work.

If you still don't get valid results, again check the following:

Did you:
- connect the wordlist output of the Process Documents output in the training branch to the input of Process Documents in the Apply branch?
- did you double check that you read correct data from both Read Database operators?
- if you don't use isGood, don't retrieve it from the database.
- find out why the label attribute is empty after Process Documents, and try to fix it. Is already empty directly after the Read Database operators?

Best, Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
johnyma22
Newbie
*
Posts: 18


« Reply #17 on: February 03, 2012, 12:44:59 PM »

Dist model looks like this:


In answer to your questions:
Yes.
Yes.
Removed isGood

I'm running some more tests now, will reply once they are completed.

Thanks Smiley
Logged
Marius
Global Moderator
Sr. Member
*****
Posts: 370



WWW
« Reply #18 on: February 03, 2012, 01:30:48 PM »

I would claim that you changed your SQL statement and don't fetch a "data" attribute with the text anymore, but your text attributes are now called "Title" and "Description". Thus, the Nominal to Text operators have to be adapted such that they don't operate on "data", but on the two new attributes. If you have only text attributes and the label, you could use "filter type" all and uncheck "include special attributes".
Didn't you get a warning or error in the "Problems" view at the bottom of RapidMiner saying sth like "The example set must contain at least one text attribute"?

Best, Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
johnyma22
Newbie
*
Posts: 18


« Reply #19 on: February 03, 2012, 02:20:39 PM »

SQL statements only get Data.

Include special attributes not checked.

Didn't get any warnings..

View:


XML is this:

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="386" width="835">
      <operator activated="true" class="read_database" compatibility="5.2.000" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT label, data, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = &quot;true&quot; OR isGood = -2 AND school_list_holiday_sources.label = &quot;false&quot; LIMIT 0,100"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.000" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true" height="480" width="815">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="90"/>
          <portSpacing port="sink_document 1" spacing="90"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.2.000" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
        <parameter key="connection" value="slave2"/>
        <parameter key="query" value="SELECT data, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != &quot;true&quot; AND isGood = 0 LIMIT 0,100"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.000" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="data"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
            <parameter key="min_chars" value="2"/>
          </operator>
          <connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
          <connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.2.000" expanded="true" height="76" name="Set Role" width="90" x="447" y="30">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="naive_bayes" compatibility="5.2.000" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30">
        <parameter key="laplace_correction" value="false"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="120">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
      <connect from_op="Apply Model (2)" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Logged
Marius
Global Moderator
Sr. Member
*****
Posts: 370



WWW
« Reply #20 on: February 03, 2012, 02:40:35 PM »

You are using Extract Content, which is for HTML documents only. Does your database contain HTML documents? What happens if you remove those operators?
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
johnyma22
Newbie
*
Posts: 18


« Reply #21 on: February 03, 2012, 02:43:07 PM »

It contains both HTML documents and Text extracted from PDF documents.
Logged
haddock
Hero Member
*****
Posts: 759



WWW
« Reply #22 on: February 03, 2012, 03:19:29 PM »

Your current XML does not set the label role on the test set, but it does on the training set.

I refer you to earlier posts in this thread from Ingo and to the help for this operator...

Quote
Please pay attention to the fact, that the application of Models will need the same attributes during application on an ExampleSet that where part of the ExampleSet it was trained on. Some minor changes like adding attributes might be possible, but might cause severe calculation errors. Please make sure, that the attributes' number, order, type and role are consistent during training and application.
Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
johnyma22
Newbie
*
Posts: 18


« Reply #23 on: February 03, 2012, 08:08:22 PM »

Okay great I have some results now Smiley Thanks guys.  I think it might be healthy for other people if I keep this thread open with my requests for how to interpret the data.
Logged
haddock
Hero Member
*****
Posts: 759



WWW
« Reply #24 on: February 03, 2012, 08:26:32 PM »

I disagree.
Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
johnyma22
Newbie
*
Posts: 18


« Reply #25 on: February 03, 2012, 08:46:51 PM »

Fair enough..  I will go ahead and fragment the learning process for people in the future.
Logged
Pages: 1 [2]
  Print  
 
Jump to: