Large amount of files to parse/organize, tips on algorithm?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • cnb

    Large amount of files to parse/organize, tips on algorithm?

    I have a bunch of files consisting of moviereviews.

    For each file I construct a list of reviews and then for each new file
    I merge the reviews so that in the end have a list of reviewers and
    for each reviewer all their reviews.

    What is the fastest way to do this?

    1. Create one file with reviews, open next file an for each review see
    if the reviewer exists, then add the review else create new reviewer.

    2. create all the separate files with reviews then mergesort them?

  • Steven D'Aprano

    #2
    Re: Large amount of files to parse/organize, tips on algorithm?

    On Tue, 02 Sep 2008 09:48:32 -0700, cnb wrote:
    I have a bunch of files consisting of moviereviews.
    >
    For each file I construct a list of reviews and then for each new file I
    merge the reviews so that in the end have a list of reviewers and for
    each reviewer all their reviews.
    >
    What is the fastest way to do this?
    Use the timeit module to find out.

    1. Create one file with reviews, open next file an for each review see
    if the reviewer exists, then add the review else create new reviewer.
    >
    2. create all the separate files with reviews then mergesort them?
    The answer will depend on whether you have three reviews or three
    million, whether each review is twenty words or twenty thousand words,
    and whether you have to do the merging once only or over and over again.


    --
    Steven

    Comment

    • cnb

      #3
      Re: Large amount of files to parse/organize, tips on algorithm?

      On Sep 2, 7:06 pm, Steven D'Aprano <st...@REMOVE-THIS-
      cybersource.com .auwrote:
      On Tue, 02 Sep 2008 09:48:32 -0700, cnb wrote:
      I have a bunch of files consisting of moviereviews.
      >
      For each file I construct a list of reviews and then for each new file I
      merge the reviews so that in the end have a list of reviewers and for
      each reviewer all their reviews.
      >
      What is the fastest way to do this?
      >
      Use the timeit module to find out.
      >
      1. Create one file with reviews, open next file an for each review see
      if the reviewer exists, then add the review else create new reviewer.
      >
      2. create all the separate files with reviews then mergesort them?
      >
      The answer will depend on whether you have three reviews or three
      million, whether each review is twenty words or twenty thousand words,
      and whether you have to do the merging once only or over and over again.
      >
      --
      Steven


      I merge once. each review has 3 fields, date rating customerid. in
      total ill be parsing between 10K and 100K, eventually 450K reviews.

      Comment

      • cnb

        #4
        Re: Large amount of files to parse/organize, tips on algorithm?

        over 17000 files...

        netflixprize.

        Comment

        • Eric Wertman

          #5
          Re: Large amount of files to parse/organize, tips on algorithm?

          I think you really want use a relational database of some sort for this.

          On Tue, Sep 2, 2008 at 2:02 PM, cnb <circularfunc@y ahoo.sewrote:

          Comment

          • Paul Rubin

            #6
            Re: Large amount of files to parse/organize, tips on algorithm?

            cnb <circularfunc@y ahoo.sewrites:
            For each file I construct a list of reviews and then for each new file
            I merge the reviews so that in the end have a list of reviewers and
            for each reviewer all their reviews.
            >
            What is the fastest way to do this?
            Scan through all the files sequentially, emitting records like

            (movie, reviewer, review)

            Then use an external sort utility to sort/merge that output file
            on each of the 3 columns. Beats writing code.

            Comment

            • jay graves

              #7
              Re: Large amount of files to parse/organize, tips on algorithm?

              On Sep 2, 1:02 pm, cnb <circularf...@y ahoo.sewrote:
              over 17000 files...
              >
              netflixprize.


              specifically:


              Comment

              Working...