Hi RM forum,
I am working on a relatively huge text data set (CSV file) with about 300,000 records and about 40,000 columns. I read the data with a "CSV Reader" and then process it via classification and clustering operator. The problem is that, as far as I know, the "CSV Reader" operator, first of all, loads the entire CSV file into main memory. The CSV file is several gigabytes and it uses more than a half of my main memory. So many algorithms starve from insufficient amount of memory.
As a workaround, I converted my data set to the fantastic "RM sparse format". I don't know how the "sparse reader" delivers the data but this could alleviate the problem of insufficient amount of main memory. Nevertheless the problem does still exist.
Now, in order to handle the problem of insufficient main memory, I have a new Idea about using a database (like postgre SQL)
! I think that a database can fetch the proper amount of examples into main memory and deliver the proper number of examples to other operators (i.e. classification operators). please explain if I am right about this. I need to know the advantages and disadvantages of using a database vs using simple files.
Any help would be appreciated.