Friday, August 29, 2014

TKLM - Things To Know Part 3

Identifying and Releasing Empty Volumes Back To Scratch

Due to the TKLM server being unable to issue keys TSM will assign tapes to a storage pool and then fail to write to the tape. To release the tapes back to scratch, after performing the resync you should check the TSM servers to see if any volumes are assigned to a storage pool but contain no data.  Use the following select statement to list the volumes with that 0 percent utilized. You will notice it creates a command within the results allowing you to quickly release the tapes with a simple cut and paste in the TSM admin command line.
select varchar(a.server_name,10) ||':'|| 'del vol', varchar(b.volume_name,8) as volname, b.pct_utilized, varchar(b.stgpool_name,15) as stgpool_name from status a, volumes b where b.pct_utilized=0 and b.devclass_name<>'DISK' order by b.stgpool_name, b.pct_utilized

You should see the following if TSM shows tape(s) with 0% utilized:

Unnamed[1]              VOLNAME        PCT_UTILIZED     STGPOOL_NAME
-------------------     ---------     -------------     ----------------
TSM01:del vol           J02579                  0.0     COPYTAPE
TSM01:del vol           J00243                  0.0     DBTAPE
TSM01:del vol           K00700                  0.0     DBTAPE_B_NC
TSM01:del vol           J00039                  0.0     LOGTAPE
TSM01:del vol           H70341                  0.0     LOGTAPE
TSM01:del vol           J00186                  0.0     LOGTAPE
TSM01:del vol           J00115                  0.0     LOGTAPE
TSM01:del vol           J00528                  0.0     LOGTAPE
TSM01:del vol           J01224                  0.0     LOGTAPE
TSM01:del vol           J01255                  0.0     LOGTAPE
You can use a portion of the results to execute against the server to release the tapes. If you’d rather not see the PCT_UTILIZED or STGPOOL_NAME then remove them from the script:
select varchar(a.server_name,10) ||':'|| 'del vol', varchar(b.volume_name,8) as volname from status a, volumes b where b.pct_utilized=0 and b.devclass_name<>'DISK' order by b.stgpool_name, b.pct_utilized

Unnamed[1]              VOLNAME
-------------------     ---------
TSM01:del vol           J02579
TSM01:del vol           J00243
TSM01:del vol           K00700
TSM01:del vol           H70341
TSM01:del vol           J00039
TSM01:del vol           J00115
TSM01:del vol           J00186
TSM01:del vol           J00528
TSM01:del vol           J01173
TSM01:del vol           J01224
TSM01:del vol           J01255

Run this select against all the TSM servers that have libraries that use the TKLM server and run the results through the TSM admin command line to release the tapes back to scratch. You will notice we are NOT using the DISCARD=YES flag for a reason. Without the discard flag TSM will not delete a volume that has some data but the amount is so low it still reports as 0% utilized.

Note: When deleting volumes DO NOT USE THE DISCARD FLAG! This will keep you from deleting a valid storage pool volume.

TKLM - Things To Know Part 2

Resolving TKLM Memory Issue

TKLM has a known issue with the Java memory heap size. This memory issue results in TKLM becoming slow to respond or stops issuing keys. You can search for an Out Of Memory condition by reviewing the TKLM /tklm/tip/profiles/TIPProfile/logs/server1/SystemOut.logand looking for the following error:

 java.lang.OutOfMemoryError

If this error is present the short term solution is to restart the primary and replica TKLM instances to resolve the out of memory state. The long term solution is to change the TKLM memory settings in two files used to determine the processes memory allotment.
·         Restart the TKLM primary and replica which will flush the memory in use and allow TKLM to issue keys as before. 

Note: This is a short term solution and does not resolve the problem as it will occur again after a period of time.
·         The permanent solution is to reduce the TKLM audit level to low and change the wsadmin process’s Java memory heap size. This needs to be done in two locations and can be done by following the steps provided:

1.     Backup the /tklm filesystem before you edit the files.

sudo dsmc /tklm

2.     Reduce the TKLM audit level to low by using the TKLM web GUI and navigating to
1)     TKLM > Configuration > Audit
2)     Select Low and click OK
Confirm by looking into this file: /tklm/tip/products/tklm/config/TKLMgrConfig.properties
 Verify that Audit.event.type and Audit.event.outcome variables state the following:

Audit.event.types = runtime, authorization, authorization_terminate, resource_management, key_management
 Audit.event.outcome = failure

3.     Edit wsadmin script and server.xml manually.
1)     You will find the two files that require editing, server.xml and wsadmin.sh, in the following directories:
/tklm/tip/profiles/TIPProfile/config/cells/TIPCell/nodes/TIPNode/servers/server1/server.xml
/tklm/tip/bin/wsadmin.sh

4.      modify the wsadmin -Xmx setting.
Example:
1) Locate and modify the below entry
default value:
PERF_JVM_OPTION(S)="-Xms256m -Xmx256m -Xj9 -Xquickstart"

set max value:
PERF_JVM_OPTION(S)="-Xms256m -Xmx1280m -Xj9 -Xquickstart"
Note: The maximum heap size for wsadmin is 1280Mb

2) Save the changes

5.     Now modify the server.xml file by setting the genericJvmArguments variable to “-Xmx2048m”
1)     Locate and modify the below entry
genericJvmArguments="-Xmx2048m"
2)     Save the changes

6.     As root stop TKLM
1)    /tklm/tip/bin/stopServer.sh server1
7.     As root start TKLM
1)    /tklm/tip/bin/startServer.sh server1

TKLM - Things To Know Part 1

DB2 Password and TKLM Data Source Out of Sync

On systems such as Linux or AIX, you might need to change the password for the DB2® Administrator user ID. The login password for the DB2 Administrator user ID and the DB2 password for the user ID must be the same.
The Tivoli Key Lifecycle Manager Installation program installs DB2 and prompts the installing person for a password for the user named tklmdb2. Additionally, the DB2 application creates an operating system user entry named tklmdb2. For example, the password for this user might expire, requiring you to resynchronize the password for both user IDs.
Typically you can identify if the DB2 ID password is no longer in sync with the data source password when you see this error when accessing TKLM through the GUI
 
Before you can change the password of the DB2 Administrator user ID, you must change the password for the system user entry. To resolve the password sync issue follow these steps:
Note: The original IBM document is located here.

1.     Log on to Tivoli Key Lifecycle Manager server as root.
2.     Change user to the tklmdb2 system user entry. Type:
su <gc>tklmdb
3.     Change the password. Type:
passwd
Specify the new password.
4.     Exit back to root.
exit
5.     In the TIP_HOME/bin directory, use the wsadmin interface that the WebSphere® Application Server provides to specify the Jython syntax.
./wsadmin.sh -username TIPAdmin -password mypwd -lang jython
6.     Change the password for the WebSphere Application Server data source:
a.     The following command lists the JAASAuthData entries:
wsadmin>print AdminConfig.list('JAASAuthData')
The result might like this example:
(cells/TIPCell|security.xml#JAASAuthData_1396539704930)
(cells/TIPCell|security.xml#JAASAuthData_1396539705604)
b.    Type the AdminConfig.showall command for each entry, to locate the alias tklm_db. For example, type on one line:
print AdminConfig.showall ('(cells/TIPCell|security.xml#JAASAuthData_1396539704930)')
The result is like this example:
[alias tklmdb]
[description "TKLM database user J2C authentication alias"]
[password *****]
[userId ustklmdb]

And also type on one line:
print AdminConfig.showall ('(cells/TIPCell|security.xml#JAASAuthData_1396539705604)')
The result is like this example:
[alias tklm_db]
[description "TKLM database user j2c authentication alias"]
[password *****]
[userId ustklmdb]

c.     Change the password for the tklm_db alias that has the identifier JAASAuthData_1396539705604:
print AdminConfig.modify('JAASAuthData_list_entry', '[[password passw0rdc]]'
For example, type on one line:
print AdminConfig.modify
('(cells/TIPCell|security.xml#JAASAuthData_1396539705604)',
'[[password <password>]]')

d.    Change the password for the tklmdb alias that has the identifier JAASAuthData_1396539704930:
print AdminConfig.modify('JAASAuthData_list_entry', '[[password passw0rdc]]'
For example, type on one line:
print AdminConfig.modify
('(cells/TIPCell|security.xml#JAASAuthData_1396539704930)',
'[[password <password>]]')

e.     Save the changes:
print AdminConfig.save()
f.     Exit back to root.
exit
g.    In the TIP_HOME/bin directory, stop the Tivoli Integrated Portal application. For example, as TIPAdmin, type on one line:
stopServer.sh server1 -username tipadmin -password passw0rd
The result is like this example:

ADMU0116I: Tool information is being logged in file
//opt/IBM/tivoli/tiptklmV2/profiles/TIPProfile/logs/server1/stopServer.log
ADMU0128I: Starting tool with the TIPProfile profile
ADMU3100I: Reading configuration for server: server1
ADMU3201I: Server stop request issued. Waiting for stop status.
ADMU4000I: Server server1 stop completed.

h.     Start the Tivoli Integrated Portal application. As the Tivoli Integrated Portal administrator, type on one line:

 startServer.sh server1

i.      In the TIP_HOME/bin directory, use the wsadmin interface that the WebSphere Application Server provides to specify the Jython syntax.

./wsadmin.sh -username tipadmin -password mypwd -lang jython

j.      Verify that you can connect to the database using the WebSphere Application Server data source.

i.       First, query for a list of data sources. Type:

print AdminConfig.list('DataSource')

The result might be like this example:

"TKLM DataSource(cells/TIPCell/nodes/TIPNode/servers/server1|resources.xml#DataSource_1396539707355)"
"TKLM scheduler XA Datasource(cells/TIPCell/nodes/TIPNode/servers/server1|resources.xml#DataSource_1396539709814)"
"Tivoli Common Reporting Data Source(cells/TIPCell|resources.xml#DataSource_1396539473259)"
DefaultEJBTimerDataSource(cells/TIPCell/nodes/TIPNode/servers/server1|resources.xml#DataSource_1000001)
ttssdb(cells/TIPCell|resources.xml#DataSource_1396539429750)

ii.      Type:
print AdminControl.testConnection('TKLM DataSource(cells....)')
For example, type on one line:
print AdminControl.testConnection (‘TKLM DataSource(cells/TIPCell/nodes/TIPNode/servers/server1|resources.xml#DataSource_1396539707355)')
iii.     Test the connection on the remaining data source. For example, type:
print AdminControl.testConnection (‘TKLM scheduler XA Datasource(cells/TIPCell/nodes/TIPNode/servers/server1|resources.xml#DataSource_1396539709814)')
iv.    In both cases, you receive a message that the connection to the data source was successful. For example:

WASX7217I: Connection to provided data source was successful.

Friday, August 15, 2014

TKLM and TSM Encryption

When it comes to encryption and TSM you find varying responses from admins. Some use the TSM server as the key manager, others implement a library based key manager, and others use a third party software product. In the past I used TSMs internal encryption key management option and while it is a set-it and forget it process it has some limitations when it comes to Exports and DB Backups.  That is where third party software like TKLM can be beneficial. I have recently implemented TKLM and after some hiccups along the way am still undecided on whether I like it.  If you use TKLM let me know your experience and if there are any issues of which I should be aware.  I'll post my hiccups next week as they will take some time to discuss.

Thursday, August 7, 2014

Poor Performance Followup

As a follow-up to the previous poor performance post I thought I'd post what the outcome was. As it turns out we checked performance tuning settings in TSM and AIX and no performance increase was seen. We asked the DB2 admins to review any of their settings and they could not find any tunables that had not already been implemented. We sent in servermon.pl output and although they saw performance that was sub-par, they couldn't designate what was causing it. There are no server/adapter/switch/disk/tape errors so nothing emerged as the culprit for our poor throughput performance.

So we reviewed the backup time of each TSM storage agent server used to backup this 101 TB SAP database. At the time the storage agents that perform the backup consisted of 5 LPARS, 4 of those in a single frame each with their own assigned I/O drawer. The 5th was in a separate 740 frame with its own I/O drawer. The 5th storage agent was completing the backup in a fraction of the time of the other 4 so we concluded we must be overloading the CEC on the 740. We moved one of the four storage agents out of the frame to a secondary frame and the results were awesome. See below:


You'll notice that the backup time didn't change with the update of the tape drives from E06 to E07. Hardware layout matters more than the performance of the tape drives. When a vendor tells you just updating the hardware to newer iterations will increase performance take it will a grain of salt. In our case we did testing of the new tape drives and no performance gains were seen but the go ahead was given to upgrade to the newer hardware and as you'll see we didn't gain anything until we reworked the environment. Our task now is to identify how to increase TSM internal job performance (i.e. migration and storage pool backup) which has not seen significant performance gains from the tape upgrades.

Wednesday, April 30, 2014

Sony Develops 185TB Tape

Sony announced they have developed a tape medium and write process that can support 185TB per tape. Whoa, That's huge! Now if we can see it hit the market before some other storage strategy catchphrase becomes the "it" thing.  Check out the link below...."To the cloud!"

http://www.extremetech.com/computing/181560-sony-develops-tech-for-185tb-tapes-3700-times-more-storage-than-a-blu-ray-disc

Wednesday, April 23, 2014

Friday, March 28, 2014

Poor Performance

Currently I work in an environment where we have a specific TSM instance for a large SAP DB (99TB currently). We just upgraded the drives in the tape library (yes we use tape! I know...I know....) from MagStar 3592 TS1130 (E06) drives to TS1140 (E07) drives. The upgrade was pushed in hopes of a jump in write/backup performance, but I was skeptical. TSM adds so much overhead you cannot use the RAW tape read/write numbers from any manufacturer. Typically IBM is somewhat reasonable with their numbers, but in this case I have seen NO performance increase what-so-ever.  Here is a query of the processes for storage pool backup.

UPDATE (04/04/2014):  Let me give you some more specs, we have the 99TB DB split between 4 TSM Storage Agents each having 4 8Gb HBA's. Each storage agent runs 4 sessions (allocates 4 drives) for their backup process. So all 4 storage agents account for 16 simultaneous sessions and it still takes over 24 hours to perform the 99TB backup. The backups are averaging around 70-78MB/sec. Is this a TSM overhead issue or do I have a tuning issue with the TDP and TSM? I'm getting less than 50% of the throughput I should see.

Here's the command that is run to execute the DB backup:

ksh -c export DB2NODE=7 ; db2 "backup db DB8   LOAD /usr/tivoli/tsm/tdp_r3/db264/libtdpdb264.a OPEN 4 SESSIONS OPTIONS /db2/DB8/dbs/tsm_config/vendor.env.7 WITH 14 BUFFERS BUFFER 1024 PARALLELISM 8 WITHOUT PROMPTING" ; echo BACKUP_RC=$?

PROCESS_NUM: 2667
    PROCESS: Backup Storage Pool
 START_TIME: 03-27 23:21:54
   DURATION: 00 23:20:13
      BYTES: 6.0TB
 AVG_THRPUT: 75.87 MB/s

PROCESS_NUM: 2668
    PROCESS: Backup Storage Pool
 START_TIME: 03-27 23:21:55
   DURATION: 00 23:20:12
      BYTES: 6.2TB
 AVG_THRPUT: 78.48 MB/s

PROCESS_NUM: 2669
    PROCESS: Backup Storage Pool
 START_TIME: 03-27 23:21:55
   DURATION: 00 23:20:12
      BYTES: 6.2TB
 AVG_THRPUT: 77.99 MB/s

PROCESS_NUM: 2670
    PROCESS: Backup Storage Pool
 START_TIME: 03-27 23:21:55
   DURATION: 00 23:20:12
      BYTES: 6.4TB
 AVG_THRPUT: 80.13 MB/s

I average anywhere from 75 to 80 MB/sec.  Here is the Magstar performance chart. I am using JB media, not JC so I do take a little hit in performance for that.










So with JB media I could get as high as 200MB/sec but I am not even 50% of that number.  Is there any specific tuning parameter I should look at that could be hindering the performance? 

FYI - The backup of the 99TB DB runs LAN-Free using 16 tape drives over 26 hrs.

Friday, January 10, 2014

New TSM Admin In The House!

Just thought I should let everyone know that my wife and I had a son December 3rd. The holidays and lead up to his being born have kept me busy. My son makes 8 kids total and I'm a very busy man. So don't worry I shall return but the last 9 months have been a blur.