populate table with large csv file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Dave [Hawk-Systems]

    populate table with large csv file

    have the table "numberchec k"
    Attribute | Type | Modifier
    -----------+------------+----------
    svcnumber | integer | not null
    svcqual | varchar(9) |
    svcequip | char(1) |
    svctroub | varchar(6) |
    svcrate | varchar(4) |
    svcclass | char(1) |
    trailer | varchar(3) |
    Index: numbercheck_pke y

    also have a csv file
    7057211380,Y,,, 3,B
    7057216800,Y,,, 3,B
    7057265038,Y,,, 3,B
    7057370261,Y,,, 3,B
    7057374613,Y,,, 3,B
    7057371832,Y,,, 3,B
    4166336554,Y,,, 3,B
    4166336863,Y,,, 3,B
    7057201148,Y,,, 3,B

    aside from parsing the csv file through a PHP interface, what isthe easiest way
    to get that csv data importted into the postgres database. thoughts?

    thanks

    Dave



    ---------------------------(end of broadcast)---------------------------
    TIP 8: explain analyze is your friend

  • Joe Conway

    #2
    Re: populate table with large csv file

    Dave [Hawk-Systems] wrote:[color=blue]
    > aside from parsing the csv file through a PHP interface, what isthe easiest way
    > to get that csv data importted into the postgres database. thoughts?
    >[/color]

    see COPY:


    Joe


    ---------------------------(end of broadcast)---------------------------
    TIP 7: don't forget to increase your free space map settings

    Comment

    • P.J. \Josh\ Rovero

      #3
      Re: populate table with large csv file

      Dave [Hawk-Systems] wrote:
      [color=blue]
      > aside from parsing the csv file through a PHP interface, what isthe easiest way
      > to get that csv data importted into the postgres database. thoughts?[/color]

      Assuming the CSV file data is well formed, use psql and
      the COPY command.

      In psql, create the table. Then issue command:

      copy <tablename> from 'filename' using delimiters ',';
      --
      P. J. "Josh" Rovero Sonalysts, Inc.
      Email: rovero@sonalyst s.com www.sonalysts.com 215 Parkway North
      Work: (860)326-3671 or 442-4355 Waterford CT 06385
      *************** *************** *************** *************** ***********


      ---------------------------(end of broadcast)---------------------------
      TIP 4: Don't 'kill -9' the postmaster

      Comment

      • Ron Johnson

        #4
        Re: populate table with large csv file

        On Thu, 2003-09-25 at 11:38, Dave [Hawk-Systems] wrote:[color=blue]
        > have the table "numberchec k"
        > Attribute | Type | Modifier
        > -----------+------------+----------
        > svcnumber | integer | not null
        > svcqual | varchar(9) |
        > svcequip | char(1) |
        > svctroub | varchar(6) |
        > svcrate | varchar(4) |
        > svcclass | char(1) |
        > trailer | varchar(3) |
        > Index: numbercheck_pke y
        >
        > also have a csv file
        > 7057211380,Y,,, 3,B
        > 7057216800,Y,,, 3,B
        > 7057265038,Y,,, 3,B
        > 7057370261,Y,,, 3,B
        > 7057374613,Y,,, 3,B
        > 7057371832,Y,,, 3,B
        > 4166336554,Y,,, 3,B
        > 4166336863,Y,,, 3,B
        > 7057201148,Y,,, 3,B
        >
        > aside from parsing the csv file through a PHP interface, what isthe easiest way
        > to get that csv data importted into the postgres database. thoughts?[/color]

        No matter what you do, it's going to barf: svcnumber is a 32-bit
        integer, and 7,057,211,380 is significantly out of range.

        Once you change svcnumber to bigint, the COPY command will easily
        suck in the csv file.

        --
        -----------------------------------------------------------------
        Ron Johnson, Jr. ron.l.johnson@c ox.net
        Jefferson, LA USA

        "Python is executable pseudocode; Perl is executable line noise"


        ---------------------------(end of broadcast)---------------------------
        TIP 6: Have you searched our list archives?



        Comment

        • Dave [Hawk-Systems]

          #5
          Re: populate table with large csv file

          >> aside from parsing the csv file through a PHP interface, what isthe[color=blue]
          >easiest way[color=green]
          >> to get that csv data importted into the postgres database. thoughts?[/color]
          >
          >Assuming the CSV file data is well formed, use psql and
          >the COPY command.
          >
          >In psql, create the table. Then issue command:
          >
          >copy <tablename> from 'filename' using delimiters ',';[/color]

          perfect solution that was overlooked.

          Unfortunately processing the 143mb file which would result in a database size of
          approx 500mb takes an eternity. As luck would have it we can get away with just
          dropping to an exec and doing a cat/grep for any data we need... takes 2-3
          seconds.

          the copy command is definately a keeper as I am not looking at replacing code
          elsewhere with a simpler model using that.

          Thanks

          Dave



          ---------------------------(end of broadcast)---------------------------
          TIP 9: the planner will ignore your desire to choose an index scan if your
          joining column's datatypes do not match

          Comment

          • Ron Johnson

            #6
            Re: populate table with large csv file

            On Fri, 2003-09-26 at 06:58, Dave [Hawk-Systems] wrote:
            [snip][color=blue]
            > Unfortunately processing the 143mb file which would result in a database size of
            > approx 500mb takes an eternity. As luck would have it we can get away with just[/color]

            Something's not right, then. I loaded 30GB in about 8 hours, on
            a slow system, with non-optimized IO. Did you drop the indexes
            first?

            --
            -----------------------------------------------------------------
            Ron Johnson, Jr. ron.l.johnson@c ox.net
            Jefferson, LA USA

            "As the night fall does not come at once, neither does
            oppression. It is in such twilight that we must all be aware of
            change in the air - however slight - lest we become unwitting
            victims of the darkness."
            Justice William O. Douglas


            ---------------------------(end of broadcast)---------------------------
            TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

            Comment

            • David Griffiths

              #7
              Tuning/performance question.


              We are doing some performance testing among various databases (Oracle, MySQL
              and Postgres).

              One of the queries is showing Postgres lagging quite a bit:

              SELECT count(*)
              FROM commercial_enti ty, country, user_account, address_list
              LEFT JOIN state_province ON address_list.st ate_province_id =
              state_province. state_province_ id
              LEFT JOIN contact_info ON address_list.co ntact_info_id =
              contact_info.co ntact_info_id
              WHERE address_list.ad dress_type_id = 101
              AND commercial_enti ty.commercial_e ntity_id =
              address_list.co mmercial_entity _id
              AND address_list.co untry_id = country.country _id
              AND commercial_enti ty.user_account _id = user_account.us er_account_id
              AND user_account.us er_role_id IN (101, 101);

              I ran a "vacuum analyze" after realizing that I had loaded all the data into
              the database without redoing the statistics; the query jumped from 19
              seconds to 41 seconds _after_ the analyze.

              I'd also like to make sure my query is performing correctly - I want all the
              count of records where the commercial_enti ty matches user_account,
              address_list, country, and a left-outer-join on address_list-province and
              address_list-contact_info.

              Finally, I read some posts on the shared_buffers; they stated that the
              shared_buffers should be set to 1/4 to 1/5 of total memory available. Is
              that correct? I give the MySQL/InnoDB buffers about 70% of the 2 gig on the
              machine.


              Here's the explain (I'm not too familiar with reading a Postgres
              explain...):

              ----------------------------------------------------------------------------
              ----------------------------------------------------------------
              Aggregate (cost=52951.09. .52951.09 rows=1 width=116)
              -> Merge Join (cost=52941.61. .52950.83 rows=105 width=116)
              Merge Cond: ("outer".countr y_id = "inner".country _id)
              -> Index Scan using country_pkey on country (cost=0.00..7.5 4
              rows=231 width=11)
              -> Sort (cost=52941.61. .52941.88 rows=105 width=105)
              Sort Key: address_list.co untry_id
              -> Merge Join (cost=52729.54. .52938.07 rows=105 width=105)
              Merge Cond: ("outer".commer cial_entity_id =
              "inner".commerc ial_entity_id)
              -> Sort (cost=8792.01.. 8792.52 rows=201 width=36)
              Sort Key: commercial_enti ty.commercial_e ntity_id
              -> Nested Loop (cost=0.00..878 4.31 rows=201
              width=36)
              -> Index Scan using usr_acc_usr_rol e_id_i
              on user_account (cost=0.00..240 3.08 rows=1401 width=12)
              Index Cond: (user_role_id =
              101::numeric)
              -> Index Scan using comm_ent_usr_ac c_id_i
              on commercial_enti ty (cost=0.00..4.5 4 rows=1 width=24)
              Index Cond:
              (commercial_ent ity.user_accoun t_id = "outer".user_ac count_id)
              -> Sort (cost=43937.53. .44173.84 rows=94526 width=69)
              Sort Key: address_list.co mmercial_entity _id
              -> Merge Join (cost=29019.03. .32585.73
              rows=94526 width=69)
              Merge Cond: ("outer".contac t_info_id =
              "inner".contact _info_id)
              -> Index Scan using contact_info_pk ey on
              contact_info (cost=0.00..336 6.76 rows=56435 width=12)
              -> Sort (cost=29019.03. .29255.34
              rows=94526 width=57)
              Sort Key:
              address_list.co ntact_info_id
              -> Merge Join
              (cost=16930.18. .18354.55 rows=94526 width=57)
              Merge Cond:
              ("outer".state_ province_id = "inner".state_p rovince_id)
              -> Index Scan using
              state_province_ pkey on state_province (cost=0.00..3.8 1 rows=67 width=11)
              -> Sort
              (cost=16930.18. .17166.50 rows=94526 width=46)
              Sort Key:
              address_list.st ate_province_id
              -> Seq Scan on
              address_list (cost=0.00..688 2.52 rows=94526 width=46)
              Filter:
              (address_type_i d = 101::numeric)

              What's the "Sort (cost...)"?

              I noticed that joining the address_list to country was slow; there was no
              index on just country_id; there were composite indexes on multiple columns,
              so I added one and did a vacuum analyze on the table, and got:

              Aggregate (cost=54115.74. .54115.74 rows=1 width=116)
              -> Merge Join (cost=54105.91. .54115.46 rows=109 width=116)
              Merge Cond: ("outer".countr y_id = "inner".country _id)
              -> Index Scan using country_pkey on country (cost=0.00..7.5 4
              rows=231 width=11)
              -> Sort (cost=54105.91. .54106.19 rows=110 width=105)
              Sort Key: address_list.co untry_id
              -> Merge Join (cost=53884.34. .54102.18 rows=110 width=105)
              Merge Cond: ("outer".commer cial_entity_id =
              "inner".commerc ial_entity_id)
              -> Sort (cost=8792.01.. 8792.52 rows=201 width=36)
              Sort Key: commercial_enti ty.commercial_e ntity_id
              -> Nested Loop (cost=0.00..878 4.31 rows=201
              width=36)
              -> Index Scan using usr_acc_usr_rol e_id_i
              on user_account (cost=0.00..240 3.08 rows=1401 width=12)
              Index Cond: (user_role_id =
              101::numeric)
              -> Index Scan using comm_ent_usr_ac c_id_i
              on commercial_enti ty (cost=0.00..4.5 4 rows=1 width=24)
              Index Cond:
              (commercial_ent ity.user_accoun t_id = "outer".user_ac count_id)
              -> Sort (cost=45092.32. .45335.37 rows=97221 width=69)
              Sort Key: address_list.co mmercial_entity _id
              -> Merge Join (cost=29770.81. .33338.09
              rows=97221 width=69)
              Merge Cond: ("outer".contac t_info_id =
              "inner".contact _info_id)
              -> Index Scan using contact_info_pk ey on
              contact_info (cost=0.00..336 6.76 rows=56435 width=12)
              -> Sort (cost=29770.81. .30013.86
              rows=97221 width=57)
              Sort Key:
              address_list.co ntact_info_id
              -> Merge Join
              (cost=17271.79. .18731.55 rows=97221 width=57)
              Merge Cond:
              ("outer".state_ province_id = "inner".state_p rovince_id)
              -> Index Scan using
              state_province_ pkey on state_province (cost=0.00..3.8 1 rows=67 width=11)
              -> Sort
              (cost=17271.79. .17514.84 rows=97221 width=46)
              Sort Key:
              address_list.st ate_province_id
              -> Seq Scan on
              address_list (cost=0.00..688 2.52 rows=97221 width=46)
              Filter:
              (address_type_i d = 101::numeric)

              No difference. Note that all the keys that are used in the joins are
              numeric(10)'s, so there shouldn't be any cast-issues.

              When you create a primary key on a table, is an index created (I seem to
              remember a message going by stating that an index would be added).

              For comparison, our production Oracle database (running on nearly identical
              hardware - the Postgres machine has IDE-RAID-5 and the Oracle machine has
              RAID mirroring) takes between 1 and 2 seconds.

              I've got one last question, and I really hope responses don't get
              sidetracked by it; I see alot of negative comments towards MySQL, many of
              them stating that it's a database layer overtop of the file system. Can
              someone explain why Postgres is better than MySQL 4.0.14 using InnoDB?
              MySQL, on the above query, with one less index (on address_list.co untry)
              takes 0.20 seconds.

              David.


              ---------------------------(end of broadcast)---------------------------
              TIP 9: the planner will ignore your desire to choose an index scan if your
              joining column's datatypes do not match

              Comment

              • Holger Marzen

                #8
                Re: Tuning/performance question.

                On Sat, 27 Sep 2003, David Griffiths wrote:
                [color=blue]
                >
                > We are doing some performance testing among various databases (Oracle, MySQL
                > and Postgres).
                >
                > One of the queries is showing Postgres lagging quite a bit:
                >
                > SELECT count(*)
                > FROM commercial_enti ty, country, user_account, address_list
                > LEFT JOIN state_province ON address_list.st ate_province_id =
                > state_province. state_province_ id
                > LEFT JOIN contact_info ON address_list.co ntact_info_id =
                > contact_info.co ntact_info_id
                > WHERE address_list.ad dress_type_id = 101
                > AND commercial_enti ty.commercial_e ntity_id =
                > address_list.co mmercial_entity _id
                > AND address_list.co untry_id = country.country _id
                > AND commercial_enti ty.user_account _id = user_account.us er_account_id
                > AND user_account.us er_role_id IN (101, 101);[/color]

                I guess that this question has been discussed very often - but I cannot
                remember why exactly. Is there a pointer to a technical explanation? Has
                it something to do with MVCC? But ist it one of MVCC's benefits that we
                can make a consistent online backup without archiving redo locks (Oracle
                can't, DB2 can). Is DB2 slower than Oracle in such cases (count(*)) as
                well?

                Workaround:
                We can sometimes fake a bit to avoid such costly queries and set up a
                trigger that calls a function that increases a counter in a separate
                counter table. Then we are lightning-fast.

                But many users compain about PostgreSQL's poor count(*) performance,
                that's true and can be critical when someone wants to replace another
                database product by PostgreSQL.

                ---------------------------(end of broadcast)---------------------------
                TIP 6: Have you searched our list archives?



                Comment

                • Tom Lane

                  #9
                  Re: Tuning/performance question.

                  Holger Marzen <holger@marzen. de> writes:[color=blue]
                  > But many users compain about PostgreSQL's poor count(*) performance,[/color]

                  I don't think that's relevant here. Some other DB's have shortcuts for
                  determining the total number of rows in a single table, that is they can
                  do "SELECT COUNT(*) FROM a_table" quickly, but David's query is messy
                  enough that I can't believe anyone can actually do it without forming
                  the join result.

                  What I'd ask for is EXPLAIN ANALYZE output. Usually, if a complex query
                  is slower than it should be, it's because the planner is picking a bad
                  plan. So you need to look at how its estimates diverge from reality.
                  But plain EXPLAIN doesn't show the reality, only the estimates ...

                  David, could we see EXPLAIN ANALYZE for the query, and also the table
                  schemas (psql \d displays would do)? Also, please take it to
                  pgsql-performance, it's not really on-topic for pgsql-general.

                  regards, tom lane

                  ---------------------------(end of broadcast)---------------------------
                  TIP 3: if posting/reading through Usenet, please send an appropriate
                  subscribe-nomail command to majordomo@postg resql.org so that your
                  message can get through to the mailing list cleanly

                  Comment

                  • David Griffiths

                    #10
                    Re: Tuning/performance question.

                    [color=blue]
                    > I guess that this question has been discussed very often - but I cannot
                    > remember why exactly. Is there a pointer to a technical explanation? Has
                    > it something to do with MVCC? But ist it one of MVCC's benefits that we
                    > can make a consistent online backup without archiving redo locks (Oracle
                    > can't, DB2 can). Is DB2 slower than Oracle in such cases (count(*)) as
                    > well?
                    >
                    > Workaround:
                    > We can sometimes fake a bit to avoid such costly queries and set up a
                    > trigger that calls a function that increases a counter in a separate
                    > counter table. Then we are lightning-fast.
                    >
                    > But many users compain about PostgreSQL's poor count(*) performance,
                    > that's true and can be critical when someone wants to replace another
                    > database product by PostgreSQL.[/color]

                    This is but one of many tests we're doing. The count(*) performance is not
                    the deciding factor. This query was pulled from our production system, and
                    I've
                    extracted the exact tables and data from the production system to test.

                    MySQL with MyISAM does in fact cheat on the count(*). InnoDB does not,
                    however. The "explain" indicates that it's doing the work, and analyzing the
                    tables dropped the cost of the query from .35 seconds to .20 seconds.

                    Here's the same query, but selecting data (to test the databases ability to
                    find a single row quicky):

                    SELECT current_timesta mp;
                    SELECT company_name, address_1, address_2, address_3, city,
                    address_list.st ate_province_id , state_province_ short_desc, country_desc,
                    zip_code, address_list.co untry_id,
                    contact_info.em ail, commercial_enti ty.user_account _id, phone_num_1,
                    phone_num_fax, website, boats_website
                    FROM commercial_enti ty, country, user_account,
                    address_list LEFT JOIN state_province ON address_list.st ate_province_id =
                    state_province. state_province_ id
                    LEFT JOIN contact_info ON address_list.co ntact_info_id =
                    contact_info.co ntact_info_id
                    WHERE address_list.ad dress_type_id = 101
                    AND commercial_enti ty.commercial_e ntity_id=225528
                    AND commercial_enti ty.commercial_e ntity_id =
                    address_list.co mmercial_entity _id
                    AND address_list.co untry_id = country.country _id
                    AND commercial_enti ty.user_account _id = user_account.us er_account_id
                    AND user_account.us er_role_id IN (101, 101);
                    SELECT current_timesta mp;

                    Postgres takes about 33 seconds to get the row back.

                    Here's the "EXPLAIN":

                    Nested Loop (cost=0.00..645 70.33 rows=1 width=385)
                    -> Nested Loop (cost=0.00..645 67.30 rows=1 width=361)
                    -> Nested Loop (cost=0.00..645 63.97 rows=1 width=349)
                    Join Filter: ("outer".commer cial_entity_id =
                    "inner".commerc ial_entity_id)
                    -> Index Scan using commercial_enti ty_pkey on
                    commercial_enti ty (cost=0.00..5.0 5 rows=1 width=94)
                    Index Cond: (commercial_ent ity_id = 225528::numeric )
                    -> Materialize (cost=63343.66. .63343.66 rows=97221
                    width=255)
                    -> Merge Join (cost=0.00..633 43.66 rows=97221
                    width=255)
                    Merge Cond: ("outer".contac t_info_id =
                    "inner".contact _info_id)
                    -> Nested Loop (cost=0.00..830 457.52 rows=97221
                    width=222)
                    Join Filter: ("outer".state_ province_id =
                    "inner".state_p rovince_id)
                    -> Index Scan using addr_list_ci_id _i on
                    address_list (cost=0.00..586 676.65 rows=97221 width=205)
                    Filter: (address_type_i d =
                    101::numeric)
                    -> Seq Scan on state_province
                    (cost=0.00..1.6 7 rows=67 width=17)
                    -> Index Scan using contact_info_pk ey on
                    contact_info (cost=0.00..336 6.76 rows=56435 width=33)
                    -> Index Scan using user_account_pk ey on user_account
                    (cost=0.00..3.3 2 rows=1 width=12)
                    Index Cond: ("outer".user_a ccount_id =
                    user_account.us er_account_id)
                    Filter: (user_role_id = 101::numeric)
                    -> Index Scan using country_pkey on country (cost=0.00..3.0 1 rows=1
                    width=24)
                    Index Cond: ("outer".countr y_id = country.country _id)
                    (20 rows)

                    David.

                    ---------------------------(end of broadcast)---------------------------
                    TIP 6: Have you searched our list archives?



                    Comment

                    • David Griffiths

                      #11
                      Re: Tuning/performance question.

                      > David, could we see EXPLAIN ANALYZE for the query, and also the table[color=blue]
                      > schemas (psql \d displays would do)? Also, please take it to
                      > pgsql-performance, it's not really on-topic for pgsql-general.
                      >
                      > regards, tom lane[/color]

                      Will do.

                      Thanks,
                      David.


                      ---------------------------(end of broadcast)---------------------------
                      TIP 8: explain analyze is your friend

                      Comment

                      • Dennis Gearon

                        #12
                        Re: Tuning/performance question.

                        Holger Marzen wrote:
                        [color=blue]
                        >On Sat, 27 Sep 2003, David Griffiths wrote:
                        >
                        >
                        >[color=green]
                        >>We are doing some performance testing among various databases (Oracle, MySQL
                        >>and Postgres).
                        >>
                        >>One of the queries is showing Postgres lagging quite a bit:
                        >>
                        >>SELECT count(*)
                        >>FROM commercial_enti ty, country, user_account, address_list
                        >>LEFT JOIN state_province ON address_list.st ate_province_id =
                        >>state_provinc e.state_provinc e_id
                        >>LEFT JOIN contact_info ON address_list.co ntact_info_id =
                        >>contact_info. contact_info_id
                        >>WHERE address_list.ad dress_type_id = 101
                        >>AND commercial_enti ty.commercial_e ntity_id =
                        >>address_list. commercial_enti ty_id
                        >>AND address_list.co untry_id = country.country _id
                        >>AND commercial_enti ty.user_account _id = user_account.us er_account_id
                        >>AND user_account.us er_role_id IN (101, 101);
                        >>
                        >>[/color]
                        >
                        >I guess that this question has been discussed very often - but I cannot
                        >remember why exactly. Is there a pointer to a technical explanation? Has
                        >it something to do with MVCC? But ist it one of MVCC's benefits that we
                        >can make a consistent online backup without archiving redo locks (Oracle
                        >can't, DB2 can). Is DB2 slower than Oracle in such cases (count(*)) as
                        >well?
                        >
                        >Workaround:
                        >We can sometimes fake a bit to avoid such costly queries and set up a
                        >trigger that calls a function that increases a counter in a separate
                        >counter table. Then we are lightning-fast.
                        >
                        >But many users compain about PostgreSQL's poor count(*) performance,
                        >that's true and can be critical when someone wants to replace another
                        >database product by PostgreSQL.
                        >
                        >---------------------------(end of broadcast)---------------------------
                        >TIP 6: Have you searched our list archives?
                        >
                        > http://archives.postgresql.org
                        >
                        >
                        >[/color]
                        Yup, it'd be nice to have faster count(*) performance.


                        ---------------------------(end of broadcast)---------------------------
                        TIP 5: Have you checked our extensive FAQ?



                        Comment

                        Working...