Pages: [1]
  Print  
Author Topic: [SOLVED] Issue with ID field (randomly edited?)  (Read 394 times)
ighyboo
Newbie
*
Posts: 25


« on: December 02, 2013, 06:46:23 PM »

Dear all,
I'm having a problem with a Process I've developed for twitter follower analysis.

The process roughly does:
-reads data twitter followers data (i.e. Twitter ID, description, location, #followers, etc.. ) from a repository or from a file.
-filter out examples with missing text description

At this point one subprocess branches out and does this:
Input: "text description", ID -> Create Vectors TF-IDF -> Cluster them into 5 groups -> Output: ID, cluster

The output of this subproces is joined with an inner join to the original dataset
ad this is where my issue starts.

After the join only ~2k ID out of ~12k original one are matching?!.

I created some breakpoint and compared the IDs along the way and the ID coming out from the subprocess are different from the original ones! (although the number of records/examples is the same) the difference are pretty random.. no evident pattern..
These ID are Integer from the original dataset and not consecutives..

Any idea?..

Here the XML from the subproces:
Code:
<operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="94" name="Cluster" width="90" x="179" y="30">
        <process expanded="true">
          <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes (2)" width="90" x="45" y="120">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value="description"/>
            <parameter key="attributes" value="|id|description"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="30">
            <list key="specify_weights">
              <parameter key="description" value="1.0"/>
            </list>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
            <parameter key="prune_method" value="percentual"/>
            <parameter key="prune_below_percent" value="0.2"/>
            <parameter key="prune_above_percent" value="99.99"/>
            <parameter key="datamanagement" value="float_array"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="112" y="120"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="210"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (6)" width="90" x="246" y="300">
                <parameter key="min_chars" value="3"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.3.002" expanded="true" height="76" name="Filter Stopwords (Dictionary)" width="90" x="380" y="165">
                <parameter key="file" value="C:\Users\MenghinI\Documents\Twitter\stopwords.txt"/>
              </operator>
              <operator activated="true" class="text:stem_snowball" compatibility="5.3.002" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
              <connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
              <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
              <connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="fast_k_means" compatibility="5.3.015" expanded="true" height="76" name="Clustering" width="90" x="447" y="30">
            <parameter key="k" value="5"/>
            <parameter key="max_runs" value="1"/>
            <parameter key="max_optimization_steps" value="1"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="120">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="cluster"/>
          </operator>
          <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join" width="90" x="447" y="210">
            <parameter key="remove_double_attributes" value="false"/>
            <list key="key_attributes"/>
          </operator>
          <connect from_port="in 1" to_op="Select Attributes (2)" to_port="example set input"/>
          <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Select Attributes (2)" from_port="original" to_op="Join" to_port="right"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="out 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Join" from_port="join" to_port="out 2"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <portSpacing port="sink_out 3" spacing="0"/>
        </process>
      </operator>
      

Any help would be greatly appreciated Smiley

Igor
« Last Edit: December 04, 2013, 05:17:47 PM by ighyboo » Logged
awchisholm
Sr. Member
****
Posts: 393


WWW
« Reply #1 on: December 02, 2013, 11:31:19 PM »

Hello Igor

Mission Impossible without the full process - almost...

What happens to the IDs before and after the k-means operator?

regards

Andrew
Logged

ighyboo
Newbie
*
Posts: 25


« Reply #2 on: December 03, 2013, 12:30:19 AM »

Dear Andrew,
thanks a lot for the prompt reply.

I identified the problem but I wonder if that's a normal behaviour or a bug (not the best judge as I don't know enough about the product)

Basically what caused the problem was the "Process document" operator, in particular the "data management" parameter.
If the parameter is set to "double_array" everything works fine, the IDs are preserved and the inner join returns all the records.

If the parameter is set to "float array" (which I set by mistake) that causes the ID to be changed and when I go to inner join the 2 branches of the process only some IDs match the original ones.

I doubt that's a normal behaviour.. maybe some side effect of storing long IDs as float?.. maybe some rounding function?.. I don't know.. Anyone else has experienced that?

If you need the original dataset to recreate the issue let me know.

Kind regards,
Igor


P.S.
I tried to post the full code of the process but it's too long for the forum  Undecided
Logged
Pages: [1]
  Print  
 
Jump to: