Testing Baan IV on Intel + Linux [a bit long] [Archive]

ariolim

24th February 2005, 16:23

Hi all.

I would like to hear from you about my testing of Baan IV on an Intel platform with Linux and Oracle 10g, host mode.

Our live environment is on a Sun Enterprise 10000 with 16 Sparc CPUs 400 Mhz; Solaris 8 + Oracle 8.1.7 DB, host mode.

Test is done on a dual Xeon 3Ghz 64bit (hyper trading on) with 4Gb memory. To evaluate the improvement (or not) I compared the executions of session tipcs5201m000 "Generate Planned PRP Orders" which is used often in our company.

Live execution in multi-bshell (1+7) takes from 3 to 5 hours.
Here come to good news: test execution, same data, one single bshell takes about 1 hour!

Now the bad news: I make tests with multi-bshell and results don't change :mad:
With 1+2, 1+4, 1+8 bshells it takes always about an hour. The only thing that changes is the idle CPU: with 1+4 is almost 0%, with 1+8 the idle is always 0%.

I have set hidden parameters in Oracle 10g as suggested by solution 166049 but nothing changes. Kernel parameter are set as required by Oracle.

Any idea about that?

Thanx,
Marco

Dikkie Dik

24th February 2005, 17:10

First of all I think that you did some nice tests. But as with all tests, there are to many pitfalls possible.

As you are using a dual CPU machine I would also expect an improvement when using multiple bshell. But maybe your system is not ideal:
- do you have sufficient memory available for other processes or has Oracle everything?
- is IO maybe a bottleneck?
- is maybe the code path not optimal and do you have locking issues?

to validate, you can use all kind of tools, but I suggest to start with the Call Graph Profiler as described in chapter 2 o the document I uploaded here (http://www.baanboard.com/baanboard/showthread.php?t=7665)

Without finding the bottleneck I advise you not to move to this new machine as life machine. Maybe it can handle a relative small number ofusers very well, but it maybe has only 1 disk or has other bottlenecks that can prevent of usin it for lots of users.

Happy digging,
Dick

naabi0

24th February 2005, 17:12

I believe you need more then two processors to see any improvement.

dave_23

24th February 2005, 17:12

What are the bshells doing? is it starting 4 but then only using 1? (ie are the others idle?)

Are you running your PRP in Regenerative Mode?

The natural conclusion that you might draw is that you have 1 or 2 big
projects that are getting generated, and the rest of them only take a very
small amount of time. So all of the work might be being done by a single
bshell...

Dave

Dikkie Dik

24th February 2005, 17:16

I believe you need more then two processors to see any improvement.

Nope. With 2 CPU's multiple bshells your throughput already can gain a lot. In fact it are HT CPU's so you will see 4 CPU's. But I expect that you perform better on 1 bshell by turning HT off.

Kind regards,
Dick

Dikkie Dik

24th February 2005, 17:21

The natural conclusion that you might draw is that you have 1 or 2 big
projects that are getting generated, and the rest of them only take a very
small amount of time. So all of the work might be being done by a single
bshell...

That is also a very interesting point. I expect that you can see this very easy by measuring the CPU time every 10 minutes. If CPU drops after 10 minutes, this can indeed come through a very lage project.

Kind regards,
Dick

ariolim

24th February 2005, 17:49

PRP is Regenerative. All processes are working hard: oracle server, baan oracle driver and bshell6.1. I think that project are distributed equally.

ariolim

24th February 2005, 18:22

As you are using a dual CPU machine I would also expect an improvement when using multiple bshell. But maybe your system is not ideal:
- do you have sufficient memory available for other processes or has Oracle everything?
- is IO maybe a bottleneck?
- is maybe the code path not optimal and do you have locking issues?
Happy digging,
Dick

1- SGA=1,2 Gb
2- When CPUs are less than 100% I see I/O wait, when CPUs are 100% I/O wait is 0.
3- Don't understand: what is code path? Locking issues? :confused:

Dikkie Dik

25th February 2005, 09:41

1- SGA=1,2 Gb
2- When CPUs are less than 100% I see I/O wait, when CPUs are 100% I/O wait is 0.
3- Don't understand: what is code path? Locking issues? :confused:

1- As you have a 4 Gb system. It llook likes memory is not a problem.
2- when is cpu 100%? Only in multi bshell mode?
3- I mean: if the code isn't optimal and you see locking or retries. If you face a lot of locking this has to do with code/ database/ data that isn't scalabele. If you sum up the count on important functions from a multiple bshell call graph output and compare it to the single bshell run this probably shows you if you faced locking / retries or not.

happy digging,
Dick

ariolim

25th February 2005, 13:42

I have made other tests, this time using "iostat" to monitor and I got interesting results:

- Disk I/O read almost 0
- I/O write 5 Mb/s avarage

With one single bshell, %util is less than 20%
With 1+4 bshell is greater than 90%

As reported by iostat man
" %util
Percentage of CPU time during which I/O
requests were issued to the device (band
width utilization for the device). Device
saturation occurs when this value is close
to 100%.
"

Could I say that disks are the possible cause of this bad performance?

Thanks,
Marco

Dikkie Dik

25th February 2005, 13:53

it is very well possible that IO is a problem as %util increases significant when running with multiple bshells. Can you describe your disk layout (number, RPM, spread of load, RAID)? How many disks and how fast? When running over 1 hour the cache of a few MB's doesn't help you anymore.

Hope this helps,
Dick

ariolim

25th February 2005, 15:44

Data disk is a 420Gb RAID-5 LUN composed by 4 Seagate 140Gb, 10000 rpm, SCSI-3. SCSI adapter is Dell Perc/4DC.

Every Oracle files is under the same directory: datafile, redo, controlfile.
A colleague tell me that it's better to move redo log away from that disk.
And possibly to increase them: now they are 10Mb each.

Dikkie Dik

25th February 2005, 16:30

In general I hate RAID 5 for production. See http://www.baarf.com/ for more info. RAID 5 is not the most performing . This can indeed be the root cause. I assume that if you remove RAID 5 and only stripe (only valid for test environments) you will see a dramatic IO improvement.

Kind regards,
Dick

ariolim

25th February 2005, 16:41

Additionally I read about a bug in Oracle 10.1.0.3 for linux 64bit (metalink docID 539559.994) that cause too many trace files dumped in the 'udump' area: actually 1 about every 4-5 seconds (!)
Moreover in my tests I see that redo log switch every 10-15 seconds (!!)

In this scenario, I/O is greatly involved in bad performance :eek:

Marco