Pages: [1]
  Print  
Author Topic: Parsing XML  (Read 947 times)
jshiller
Newbie
*
Posts: 5


« on: December 06, 2010, 05:09:35 AM »

I've been experimenting with the REST API from LastFM. My query to the API asks for artists similar to Bono.

Here's the XML file that the query generates:
http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist=bono&api_key=b25b959554ed76058ac220b7b2e0a026

I'm trying to parse the XML file and generate output that provides "artist" and "match" for each of the 100 entries in the XML file. The current output generates 200 rows containing the URL I'm querying, the full contents of the page, and the name of the attributes I setup with XPATH queries. The output I want to see is a different artist name and associated match number on each row. Any advice on how to achieve this is greatly appreciated.

Thanks,
Jamie

This is what I want to see in the Data View:



This is what I currently see in the Data View:



Here's my process:

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <process expanded="true" height="628" width="736">
      <operator activated="true" class="web:process_web" compatibility="5.0.4" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
        <parameter key="url" value="http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&amp;artist=bono&amp;api_key=b25b959554ed76058ac220b7b2e0a026"/>
        <list key="crawling_rules">
          <parameter key="0" value="http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&amp;artist=bono&amp;api_key=b25b959554ed76058ac220b7b2e0a026"/>
        </list>
        <parameter key="add_pages_as_attribute" value="true"/>
        <parameter key="max_pages" value="1"/>
        <process expanded="true" height="481" width="788">
          <operator activated="true" class="text:cut_document" compatibility="5.0.7" expanded="true" height="60" name="Cut Document" width="90" x="70" y="46">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="name" value="/h:lfm/h:similarartists/h:artist/h:name"/>
              <parameter key="match" value="/h:lfm/h:similarartists/h:artist/h:match"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="463" width="702">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="write_database" compatibility="5.0.8" expanded="true" height="60" name="Write Database" width="90" x="246" y="30">
        <parameter key="connection" value="AWS RDS"/>
        <parameter key="table_name" value="artists"/>
        <parameter key="overwrite_mode" value="append"/>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Database" to_port="input"/>
      <connect from_op="Write Database" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
« Last Edit: December 07, 2010, 11:34:12 PM by jshiller » Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #1 on: December 08, 2010, 11:14:58 AM »

Hi,
this is really advanced parsing. Normally I would not post a complete process but simply outlying the way to go, but it's a great example of what one can do with the Text Processing and Web Extension in combination. So here's this very cool process:

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <process expanded="true" height="628" width="736">
      <operator activated="true" class="web:get_webpage" compatibility="5.0.4" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
        <parameter key="url" value="http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&amp;artist=bono&amp;api_key=b25b959554ed76058ac220b7b2e0a026"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:cut_document" compatibility="5.0.7" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="artist" value="h:lfm/h:similarartists/h:artist"/>
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <process expanded="true" height="279" width="743">
          <operator activated="true" class="text:extract_information" compatibility="5.0.7" expanded="true" height="60" name="Extract Information" width="90" x="335" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="name" value="//h:name/text()"/>
              <parameter key="match" value="//h:match/text()"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="segment" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.0.7" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true" height="261" width="743">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Greetings,
  Sebastian
Logged
jshiller
Newbie
*
Posts: 5


« Reply #2 on: December 09, 2010, 10:21:48 AM »

Sebastian,

Thanks so much for providing the complete process! This helps a lot.

Best,

Jamie
Logged
Pages: [1]
  Print  
 
Jump to: