Wednesday, April 23, 2014
IBM Tivoli Storage Manager is NOT affected by the OpenSSL Heartbleed vulnerability
I was asked to check into the Heartbleed bug and whether TSM was vulnerable. A quick google of TSM Heartbleed produced the following link which states:
Friday, March 28, 2014
Poor Performance
Currently I work in an environment where we have a specific TSM instance for a large SAP DB (99TB currently). We just upgraded the drives in the tape library (yes we use tape! I know...I know....) from MagStar 3592 TS1130 (E06) drives to TS1140 (E07) drives. The upgrade was pushed in hopes of a jump in write/backup performance, but I was skeptical. TSM adds so much overhead you cannot use the RAW tape read/write numbers from any manufacturer. Typically IBM is somewhat reasonable with their numbers, but in this case I have seen NO performance increase what-so-ever. Here is a query of the processes for storage pool backup.
UPDATE (04/04/2014): Let me give you some more specs, we have the 99TB DB split between 4 TSM Storage Agents each having 4 8Gb HBA's. Each storage agent runs 4 sessions (allocates 4 drives) for their backup process. So all 4 storage agents account for 16 simultaneous sessions and it still takes over 24 hours to perform the 99TB backup. The backups are averaging around 70-78MB/sec. Is this a TSM overhead issue or do I have a tuning issue with the TDP and TSM? I'm getting less than 50% of the throughput I should see.
Here's the command that is run to execute the DB backup:
ksh -c export DB2NODE=7 ; db2 "backup db DB8 LOAD /usr/tivoli/tsm/tdp_r3/db264/libtdpdb264.a OPEN 4 SESSIONS OPTIONS /db2/DB8/dbs/tsm_config/vendor.env.7 WITH 14 BUFFERS BUFFER 1024 PARALLELISM 8 WITHOUT PROMPTING" ; echo BACKUP_RC=$?
PROCESS_NUM: 2667
PROCESS: Backup Storage Pool
START_TIME: 03-27 23:21:54
DURATION: 00 23:20:13
BYTES: 6.0TB
AVG_THRPUT: 75.87 MB/s
PROCESS_NUM: 2668
PROCESS: Backup Storage Pool
START_TIME: 03-27 23:21:55
DURATION: 00 23:20:12
BYTES: 6.2TB
AVG_THRPUT: 78.48 MB/s
PROCESS_NUM: 2669
PROCESS: Backup Storage Pool
START_TIME: 03-27 23:21:55
DURATION: 00 23:20:12
BYTES: 6.2TB
AVG_THRPUT: 77.99 MB/s
PROCESS_NUM: 2670
PROCESS: Backup Storage Pool
START_TIME: 03-27 23:21:55
DURATION: 00 23:20:12
BYTES: 6.4TB
AVG_THRPUT: 80.13 MB/s
UPDATE (04/04/2014): Let me give you some more specs, we have the 99TB DB split between 4 TSM Storage Agents each having 4 8Gb HBA's. Each storage agent runs 4 sessions (allocates 4 drives) for their backup process. So all 4 storage agents account for 16 simultaneous sessions and it still takes over 24 hours to perform the 99TB backup. The backups are averaging around 70-78MB/sec. Is this a TSM overhead issue or do I have a tuning issue with the TDP and TSM? I'm getting less than 50% of the throughput I should see.
Here's the command that is run to execute the DB backup:
ksh -c export DB2NODE=7 ; db2 "backup db DB8 LOAD /usr/tivoli/tsm/tdp_r3/db264/libtdpdb264.a OPEN 4 SESSIONS OPTIONS /db2/DB8/dbs/tsm_config/vendor.env.7 WITH 14 BUFFERS BUFFER 1024 PARALLELISM 8 WITHOUT PROMPTING" ; echo BACKUP_RC=$?
PROCESS_NUM: 2667
PROCESS: Backup Storage Pool
START_TIME: 03-27 23:21:54
DURATION: 00 23:20:13
BYTES: 6.0TB
AVG_THRPUT: 75.87 MB/s
PROCESS_NUM: 2668
PROCESS: Backup Storage Pool
START_TIME: 03-27 23:21:55
DURATION: 00 23:20:12
BYTES: 6.2TB
AVG_THRPUT: 78.48 MB/s
PROCESS_NUM: 2669
PROCESS: Backup Storage Pool
START_TIME: 03-27 23:21:55
DURATION: 00 23:20:12
BYTES: 6.2TB
AVG_THRPUT: 77.99 MB/s
PROCESS_NUM: 2670
PROCESS: Backup Storage Pool
START_TIME: 03-27 23:21:55
DURATION: 00 23:20:12
BYTES: 6.4TB
AVG_THRPUT: 80.13 MB/s
I average anywhere from 75 to 80 MB/sec. Here is the Magstar performance chart. I am using JB media, not JC so I do take a little hit in performance for that.
So with JB media I could get as high as 200MB/sec but I am not even 50% of that number. Is there any specific tuning parameter I should look at that could be hindering the performance?
FYI - The backup of the 99TB DB runs LAN-Free using 16 tape drives over 26 hrs.
Friday, January 10, 2014
New TSM Admin In The House!
Just thought I should let everyone know that my wife and I had a son December 3rd. The holidays and lead up to his being born have kept me busy. My son makes 8 kids total and I'm a very busy man. So don't worry I shall return but the last 9 months have been a blur.
Sunday, December 8, 2013
Full TSMExplorer for TSM version 5 is free now
Got this info from Dmitry Dukhov - creator of TSMExplorer
Procedure of required registration for getting free license and TSMExplorer for TSM version 5 are available on http://www.s-iberia.com/download.html
Procedure of required registration for getting free license and TSMExplorer for TSM version 5 are available on http://www.s-iberia.com/download.html
Tuesday, October 22, 2013
Archive Report
Where I work we have a process that bi-monthly generates a mksysb then archives it to TSM. Recently an attempt to use an archived mksysb found that sometimes the mksysb process does not create a valid file, but it is still archived to TSM. So the other AIX admins asked me to generate a report that would show the amount of data that was archived and on what date it occurred. Now I would have told them it was impossible if they had asked for data from the backup table, but our archive table is not as large as the backups so I gave it a go.
First problem was determining the best table(s) to use. I could use the summary table, but it doesn't tell me what schedule ran and some of these UNIX servers do have archive schedules other than the mksysb process. The idea I came up with was to query the contents table and join it with the archive table using the object_id field. Here's an example of the command:
select a.node_name, a.filespace_name, a.object_id, cast((b.file_size/1048576)as integer(9,2))AS SIZE_MB , cast((a.ARCHIVE_DATE)as date) as ARCHIVE from archives a, contents b where a.node_name=b.node_name and a.filespace_name='/mksysb_apitsm' and a.filespace_name=b.filespace_name and a.object_id=b.object_id and a.node_name like 'USA%'
This select takes at least 20 hours to run across 6 TSM servers. I guess that I should be happy it returns at all, but TSM is DB2! It should be a lot faster, so I am wondering if I could clean up the script or add something that would make the index the data faster??? I am considering dropping the "like" and just matching node_name between the two tables. Would putting node_name matching first then matching object_id be faster? Would I be better off running it straight out of DB2? Suggestions appreciated.
First problem was determining the best table(s) to use. I could use the summary table, but it doesn't tell me what schedule ran and some of these UNIX servers do have archive schedules other than the mksysb process. The idea I came up with was to query the contents table and join it with the archive table using the object_id field. Here's an example of the command:
select a.node_name, a.filespace_name, a.object_id, cast((b.file_size/1048576)as integer(9,2))AS SIZE_MB , cast((a.ARCHIVE_DATE)as date) as ARCHIVE from archives a, contents b where a.node_name=b.node_name and a.filespace_name='/mksysb_apitsm' and a.filespace_name=b.filespace_name and a.object_id=b.object_id and a.node_name like 'USA%'
This select takes at least 20 hours to run across 6 TSM servers. I guess that I should be happy it returns at all, but TSM is DB2! It should be a lot faster, so I am wondering if I could clean up the script or add something that would make the index the data faster??? I am considering dropping the "like" and just matching node_name between the two tables. Would putting node_name matching first then matching object_id be faster? Would I be better off running it straight out of DB2? Suggestions appreciated.
Monday, August 12, 2013
TSM Command Processing Tip
I am constantly having to run a large list of commands and sometimes just don't want to deal with running them through a shell script. So whats the best way to run a list of commands without having to deal with TSM prompting for a YES/NO. I can using a batch command with the -NOPROMPT option from a admin command-line, but sometimes thats more work than I want to deal with. There's got to be a better way. Well the simple answer is to define the TSM server to itself and use it in the command when you run it. Here's an example....I have to delete empty volumes from storage pools rather than wait for the 1 day delay.
select 'ustsm07:del vol', cast((volume_name)as char(8)) as VOLNAME, from volumes where pct_utilized=0 and devclass_name <> 'DISK'
RESULTS:
Unnamed[1] VOLNAME
---------------- ---------
ustsm07:del vol K00525
ustsm07:del vol K00526
ustsm07:del vol J00789
ustsm07:del vol J00197
ustsm07:del vol J00303
ustsm07:del vol J01172
ustsm07:del vol J01233
ustsm07:del vol J00850
ustsm07:del vol J00861
ustsm07:del vol K00018
ustsm07:del vol J01613
ustsm07:del vol J01624
ustsm07:del vol J01671
ustsm07:del vol J01687
ustsm07:del vol K00116
ustsm07:del vol K00130
ustsm07:del vol K00340
ustsm07:del vol K00348
tsm: USTSM07>USTSM07:del vol K00525
ANR1699I Resolved USTSM07 to 1 server(s) - issuing command DEL VOL K00525 against server(s).
ANR1687I Output for command 'DEL VOL K00525' issued against server USTSM07 follows:
ANR2208I Volume K00525 deleted from storage pool TAPE_A.
ANR1688I Output for command 'DEL VOL K00525' issued against server USTSM07 completed.
ANR1694I Server USTSM07 processed command 'DEL VOL K00525' and completed successfully.
ANR1697I Command 'DEL VOL K00525 processed by 1 server(s): 1 successful, 0 with warnings, and 0 with errors.
So I copy the data and paste it into my command line and because I am using server routing (even to the same server I am on) TSM does not prompt for confirmation. So make sure you have defined your TSM servers to themselves so you can take advantage of this simple feature. Also note that TSM wont delete a tape with data, so I leave the "DISCARD=YES" option off so only EMPTY tapes are deleted.
select 'ustsm07:del vol', cast((volume_name)as char(8)) as VOLNAME, from volumes where pct_utilized=0 and devclass_name <> 'DISK'
RESULTS:
Unnamed[1] VOLNAME
---------------- ---------
ustsm07:del vol K00525
ustsm07:del vol K00526
ustsm07:del vol J00789
ustsm07:del vol J00197
ustsm07:del vol J00303
ustsm07:del vol J01172
ustsm07:del vol J01233
ustsm07:del vol J00850
ustsm07:del vol J00861
ustsm07:del vol K00018
ustsm07:del vol J01613
ustsm07:del vol J01624
ustsm07:del vol J01671
ustsm07:del vol J01687
ustsm07:del vol K00116
ustsm07:del vol K00130
ustsm07:del vol K00340
ustsm07:del vol K00348
tsm: USTSM07>USTSM07:del vol K00525
ANR1699I Resolved USTSM07 to 1 server(s) - issuing command DEL VOL K00525 against server(s).
ANR1687I Output for command 'DEL VOL K00525' issued against server USTSM07 follows:
ANR2208I Volume K00525 deleted from storage pool TAPE_A.
ANR1688I Output for command 'DEL VOL K00525' issued against server USTSM07 completed.
ANR1694I Server USTSM07 processed command 'DEL VOL K00525' and completed successfully.
ANR1697I Command 'DEL VOL K00525 processed by 1 server(s): 1 successful, 0 with warnings, and 0 with errors.
Wednesday, July 31, 2013
IBM P7 Strange Behaviour
We have a P7 frame that has 4 LPARs that are used as TSM storage agents from which snapshots of our SAP DB's are mounted for backup. They have always had great performance until one LPAR had a bad HBA that phoned home and was replaced. After it was replaced performance for backups dramatically decreased from 800MB/s to 150MB/s and overall performance of the server would drastically drop. When the DB requiring backup is over 25TB that is a huge hit, and we could not find the root cause. At first IBM said it was our Hitachi disk that was the problem. We eliminated that right away, so we then replaced the new HBA, checked our fiber, and then checked the GBIC and nothing seemed to fix the situation. During the first week I asked the IBM service technician if we could possibly have a bad drawer or slot and he emphatically said "No! If you did you would have errors all over the place." So we checked firmware, we moved cards within the frame (again), we double checked the fiber, now we were going into the third week. So I kept asking if something could be wrong with the drawer/slots and I kept getting the same answer. The reason I suggested it was due to previous experience. I have seen hardware go bad without totally going "out". So after exhausting everything other than the replacing the slots, IBM finally replaced the slots. Viola! Backup speeds went back to normal and system degradation during the backup disappeared. So the slots/drawer was the issue. No errors relating to a slot/drawer hardware issue occurred but something caused the slots to degrade performance. It took almost a month to resolve the issue, I wouldn't say that IBM support was very thorough and at times tried to push off the problem to other vendors (i.e. Hitachi). I can only suggest in the future you trust your instincts and push the CE's to follow down every avenue. My headache is over, but now the RCA begins.
Subscribe to:
Posts (Atom)