How to improve the performance of COPY commands?

classic Classic list List threaded Threaded
5 messages Options
李玉珏@163 李玉珏@163
Reply | Threaded
Open this post in threaded view
|

How to improve the performance of COPY commands?

If the COPY command is used to import a large amount of data, the execution time is a little long.
In the current test environment, the performance is about more than 10,000/s, so if it is 100 million data, it will take several hours.

Is there a faster way to import, or is COPY working in parallel?

thanks!

Ivan Pavlukhin Ivan Pavlukhin
Reply | Threaded
Open this post in threaded view
|

Re: How to improve the performance of COPY commands?

Hi,

Currently COPY is a mechanism designed for the fastest data load. Yes,
you can try separate your data in chunks and execute COPY in parallel.
By the way, where is your input located and what is it size in bytes
(Gb)? Is persistence enabled? Does a DataRegion have enough memory to
keep all data?

ср, 10 июл. 2019 г. в 05:02, 18624049226 <[hidden email]>:
>
> If the COPY command is used to import a large amount of data, the execution time is a little long.
> In the current test environment, the performance is about more than 10,000/s, so if it is 100 million data, it will take several hours.
>
> Is there a faster way to import, or is COPY working in parallel?
>
> thanks!
>


--
Best regards,
Ivan Pavlukhin
李玉珏@163 李玉珏@163
Reply | Threaded
Open this post in threaded view
|

Re: How to improve the performance of COPY commands?

Hi,

The CSV file is about 250 GB, with about 1 billion rows of data.
Persistence is on and there is enough memory.
It has been successfully imported, but it takes a long time.

The problem at present is that the data of this large table is imported
successfully, and then 50 million tables are imported. The speed of data
writing is significantly slowed down.

We have four hosts in total. The cache configuration is as follows:
<property name="backups" value="1"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>

With persistence enabled, the other parameters are nothing special.

在 2019/7/12 下午1:47, Павлухин Иван 写道:

> Hi,
>
> Currently COPY is a mechanism designed for the fastest data load. Yes,
> you can try separate your data in chunks and execute COPY in parallel.
> By the way, where is your input located and what is it size in bytes
> (Gb)? Is persistence enabled? Does a DataRegion have enough memory to
> keep all data?
>
> ср, 10 июл. 2019 г. в 05:02, 18624049226 <[hidden email]>:
>> If the COPY command is used to import a large amount of data, the execution time is a little long.
>> In the current test environment, the performance is about more than 10,000/s, so if it is 100 million data, it will take several hours.
>>
>> Is there a faster way to import, or is COPY working in parallel?
>>
>> thanks!
>>
>

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: How to improve the performance of COPY commands?

Hello!

The recommendation here is to disable WAL before ingesting data in a table. You can do that by issuing ALTER TABLE tbl NOLOGGING;

After data is loaded, you should turn it back on by ALTER TABLE tbl LOGGING.

Regards,
--
Ilya Kasnacheev


пт, 12 июл. 2019 г. в 11:33, liyuj <[hidden email]>:
Hi,

The CSV file is about 250 GB, with about 1 billion rows of data.
Persistence is on and there is enough memory.
It has been successfully imported, but it takes a long time.

The problem at present is that the data of this large table is imported
successfully, and then 50 million tables are imported. The speed of data
writing is significantly slowed down.

We have four hosts in total. The cache configuration is as follows:
<property name="backups" value="1"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>

With persistence enabled, the other parameters are nothing special.

在 2019/7/12 下午1:47, Павлухин Иван 写道:
> Hi,
>
> Currently COPY is a mechanism designed for the fastest data load. Yes,
> you can try separate your data in chunks and execute COPY in parallel.
> By the way, where is your input located and what is it size in bytes
> (Gb)? Is persistence enabled? Does a DataRegion have enough memory to
> keep all data?
>
> ср, 10 июл. 2019 г. в 05:02, 18624049226 <[hidden email]>:
>> If the COPY command is used to import a large amount of data, the execution time is a little long.
>> In the current test environment, the performance is about more than 10,000/s, so if it is 100 million data, it will take several hours.
>>
>> Is there a faster way to import, or is COPY working in parallel?
>>
>> thanks!
>>
>

李玉珏@163 李玉珏@163
Reply | Threaded
Open this post in threaded view
|

Re: How to improve the performance of COPY commands?

Thank you for your reply. I'll try it.

在 2019/7/12 下午6:14, Ilya Kasnacheev 写道:
Hello!

The recommendation here is to disable WAL before ingesting data in a table. You can do that by issuing ALTER TABLE tbl NOLOGGING;

After data is loaded, you should turn it back on by ALTER TABLE tbl LOGGING.

Regards,
--
Ilya Kasnacheev


пт, 12 июл. 2019 г. в 11:33, liyuj <[hidden email]>:
Hi,

The CSV file is about 250 GB, with about 1 billion rows of data.
Persistence is on and there is enough memory.
It has been successfully imported, but it takes a long time.

The problem at present is that the data of this large table is imported
successfully, and then 50 million tables are imported. The speed of data
writing is significantly slowed down.

We have four hosts in total. The cache configuration is as follows:
<property name="backups" value="1"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>

With persistence enabled, the other parameters are nothing special.

在 2019/7/12 下午1:47, Павлухин Иван 写道:
> Hi,
>
> Currently COPY is a mechanism designed for the fastest data load. Yes,
> you can try separate your data in chunks and execute COPY in parallel.
> By the way, where is your input located and what is it size in bytes
> (Gb)? Is persistence enabled? Does a DataRegion have enough memory to
> keep all data?
>
> ср, 10 июл. 2019 г. в 05:02, 18624049226 <[hidden email]>:
>> If the COPY command is used to import a large amount of data, the execution time is a little long.
>> In the current test environment, the performance is about more than 10,000/s, so if it is 100 million data, it will take several hours.
>>
>> Is there a faster way to import, or is COPY working in parallel?
>>
>> thanks!
>>
>