Tuesday, December 21, 2010


A java.lang.OutOfMemoryError is a subclass of java.lang.VirtualMachineError that is thrown when the Java Virtual Machine is broken or has run out of resources that are necessary to continue the operation of the Java Virtual Machine. Obviously, memory is the exhausted resource for a java.lang.OutOfMemoryError, which is thrown when the Java Virtual Machine cannot allocate an object due to memory constraints.

Occasionally we will meet OOME in QA or development environment, which is an error and JVM will stop running. Usually we can tell if OOME is from java heap or native memory from the error message. There are a lot of blogs/articles discussing this area, and from Google I found these articles are simple and easy to understand.
These are the list of OutOfMemoryErrors a typical java application can see:
  • java.lang.OutOfMemoryError: Perm Space 
    •  SUN JVM has one but Jrockit doesn't have an allocated permspace
    • -XX:MaxPermSize -XX:PermSize
  • java.lang.OutOfMemoryError: Java heap space 
    • jmap, jhat,  YourKit, Jprofiler (heap dump)
    • -Xms128m -Xmx1024m
  • java.lang.OutOfMemoryError: unable to create new native thread 
    • The java process size has reached its limit (OS tries to reserve space for thread stack within process address space)
    • -Xss512k
  • java.lang.OutOfMemoryError: GC overhead limit exceeded 
    • JVM took too long to free up memory during its GC process. This error can be thrown from the Parallel or Concurrent collectors.
    • -XX:-UseGCOverheadLimit
    •  The parallel collector will throw an OutOfMemoryError if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown. This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small. If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the command line.
  • java.lang.OutOfMemoryError: requested xxxx bytes for Chunk::new. Out of swap space
    • The process could not allocate memory as the virtual memory was full
    • The solution is to reduce the amount of heap memory you have allocated.
  • java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    • There is a memory request for an array but that's too large for a predefined limit of a virtual machine
    • We need to check the source code to make sure that there's no huge array created dynamically or statically. 
Side note: Another setting you may want to check on is the system ulimit setting. The ulimit setting controls the limit on how many file handles a single process can have open. Since the unix philosophy is to treat nearly everything as a file, this is trickier than it seems. Running out of file handles can be reported as an OutOfMemory error, which may take a while to figure out. Most flavors of unix have a reasonably high default ulimit these days, but just in case, check.

Thursday, December 16, 2010

Filesystem read-only caused ORA-01034: ORACLE not available

Usually ORA-01034: ORACLE not available means either db instance is not up, or ORACLE_HOME or ORACLE_SID is not set correctly from the environment you're trying to connect the db instance.

For details, see http://www.freelists.org/post/oracle-l/fixing-a-bad-oracle-install,1 
Oracle uses a proprietary algorithm that combines the ORACLE_HOME and ORACLE_SID to come up with a shared memory key, which is used at shared memory segment creation time,
i.e., when the SGA is allocated.  After that, further bequeath connections must
have the same ORACLE_HOME and ORACLE_SID defined, so that they can define the
same key value, and use it to attach to that existing SGA.  If the ORACLE_HOME
and/or ORACLE_SID is set  incorrectly, the key value will be calculated
incorrectly, and the server process will not be able to attach to the SGA
shared memory segments.

Recently we had the same error on QA environment
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
Linux-x86_64 Error: 2: No such file or directory

Root Cause:
The root cause is not caused by above 2 reasons, but from "Remounting filesystem read-only".

[root@mvnowdb03 ~]# cat /var/log/messages
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: BMDMA stat 0x25
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: cmd 35/00:20:f8:18:3d/00:00:2d:00:00/e0 tag 0 dma 16384 out
Dec 15 10:03:14 mvnowdb03 kernel:          res 51/10:20:f8:18:3d/10:00:2d:00:00/e0 Emask 0x81 (invalid argument)
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: status: { DRDY ERR }
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: error: { IDNF }
Dec 15 10:03:15 mvnowdb03 kernel: ata1.00: configured for UDMA/133
Dec 15 10:03:15 mvnowdb03 kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Dec 15 10:03:15 mvnowdb03 kernel: sda: Current [descriptor]: sense key: Aborted Command
Dec 15 10:03:15 mvnowdb03 kernel:     Add. Sense: Recorded entity not found
Dec 15 10:03:15 mvnowdb03 kernel:
Dec 15 10:03:15 mvnowdb03 kernel: Descriptor sense data with sense descriptors (in hex):
Dec 15 10:03:15 mvnowdb03 kernel:         72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Dec 15 10:03:16 mvnowdb03 kernel:         2d 3d 18 f8
Dec 15 10:03:16 mvnowdb03 kernel: end_request: I/O error, dev sda, sector 758978808
Dec 15 10:03:16 mvnowdb03 kernel: Buffer I/O error on device sda9, logical block 35429835
Dec 15 10:03:16 mvnowdb03 kernel: lost page write due to I/O error on sda9
Dec 15 10:03:16 mvnowdb03 kernel: Buffer I/O error on device sda9, logical block 35429836
Dec 15 10:03:16 mvnowdb03 kernel: lost page write due to I/O error on sda9
Dec 15 10:03:16 mvnowdb03 kernel: Buffer I/O error on device sda9, logical block 35429837
Dec 15 10:03:16 mvnowdb03 kernel: lost page write due to I/O error on sda9
Dec 15 10:03:16 mvnowdb03 kernel: Buffer I/O error on device sda9, logical block 35429838
Dec 15 10:03:16 mvnowdb03 kernel: lost page write due to I/O error on sda9
Dec 15 10:03:16 mvnowdb03 kernel: ata1: EH complete
Dec 15 10:03:16 mvnowdb03 kernel: SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
Dec 15 10:03:16 mvnowdb03 kernel: sda: Write Protect is off
Dec 15 10:03:16 mvnowdb03 kernel: SCSI device sda: drive cache: write back
Dec 15 10:03:16 mvnowdb03 kernel: Aborting journal on device sda9.
Dec 15 10:03:16 mvnowdb03 kernel: ext3_abort called.
Dec 15 10:03:16 mvnowdb03 kernel: EXT3-fs error (device sda9): ext3_journal_start_sb: Detected aborted journal
Dec 15 10:03:16 mvnowdb03 kernel: Remounting filesystem read-only
Dec 15 10:03:16 mvnowdb03 kernel: __journal_remove_journal_head: freeing b_committed_data

  1. There is suggestion to unmount the hard disk and run a fsck on it, then remount. After that, we need restart DB instances and app servers to rebuild connection pool
  2. Replace the dying (bad) disk

Wednesday, December 15, 2010

File format caused md5sum check failure

Every release we need md5 checksum to ensure the released files are integral. md5sum is a useful tool for generating and verifying md5 checksum.

Here is some examples:
C:\>md5sum --help
C:\>md5sum *.jar > test.md5
C:\>md5sum -c test.md5
release-part1.jar: OK
release-part2.jar: OK

QA reported issue when tested out md5 with below info:
md5sum -c test.md5
: No such file or directory
: FAILED open or read
: No such file or directoryar
: FAILED open or read
md5sum: WARNING: 2 of 2 listed files could not be read

Root Cause:
The md5 format is DOS because I generated it using md5sum in Window desktop. When QA verified in Linux environment, we got error.

  1. Generate md5 from Linux OS instead of Windows
  2. Converted generated md5 file format from DOS to UNIX (using UltraEdit etc editors)

Thursday, November 18, 2010

CSS Sprites

In the field of Web performance, CSS Sprites is one useful skill to merge small images into a bigger one (also called master image) to save number of HTTP requests and to reduce download size. There are a lot of discussions and explanations about CSS Sprites, but the root is simple - rely on CSS background-position property to display partial of bigger image. I don't want to explain the ins and outs of this skill, but want to put some excerpts here for quick reference.

One typical case is building a grid of images and devising a way to get each individual cell out of the grid, we can store all buttons/navigation items/whatever we wish in a single master image file, along with the associated “before” and “after” link states.

CSS sprites are a way to reduce the number of HTTP requests made for image resources referenced by your site. Images are combined into one larger image at defined X and Y coorindates. Having assigned this generated image to relevant page elements the background-position CSS property can then be used to shift the visible area to the required component image.

In this case, multiple images used throughout a website would be combined into the so-called “master image.” To display a single image from the master image, one would use the background-position property in CSS, defining the exact position of the image to be displayed. Any hover, active or focus effects would be implemented using the simple definition of the background-position property for the displayed element. CSS sprites are commonly used, particularly for navigation (such as for hover effects), icons and buttons.

CSS sprites group multiple images into one composite image and display them using CSS background positioning. You can save a significant amount of HTTP requests by consolidating your images into one or more composite sprites and using CSS to selectively display parts of the sprite within your web page.

Yahoo! uses a number of CSS hacks to allow for different browsers in their code.They also create different style sheets optimized for different user agents, a non-trivial amount of work. They include the backslash hack, underscore hack, voice family hack, and property:value; hack which is part of another CSS hack.

The smaller size of sprites is due to the reduced overhead of multiple color tables and formatting information required by separate images. To maximize accessibility and usability, CSS sprites are best used for icons or decorative effects.

There are a bunch of tools for CSS Sprites and image optimization.
SmartSprites http://csssprites.org/
CSS Sprites Generator http://csssprites.com/
SpriteMe http://spriteme.org/

The background-position property sets the starting position of a background image. It is supported in all major browsers. For this to work in Firefox and Opera, the background-attachment property must be set to "fixed". For more information, check out CSS background-position Property

Firebug Net Panel Tooltip Quick Introduction

Firebug is a very popular Firefox plugin for Web developers. When we start to look at front-end performance, its built in Net Panel provides a very handy and useful view for network traffic between browser and originating servers at the back-end. As of this post writing, firebug latest version is 1.5.4.

Software is hard provides many detailed info about firebug, and including blogs and tutorial. I just want to recap what I got from them and other resources, regarding its Tooltip analysis data.

Here is one example

Each request has several phases from the tooltip, here are some notes about each request phase.

    * DNS Lookup: DNS resolution time, from domain name to IP. When it is cached, the time is usually zero.
    * Connecting: Elapsed time required to create a TCP connection (RTT?)
    * Blocking (not on the screenshot): Elapsed time spent in a browser queue waiting for a network connection. Displayed only in the case where this situation happens because browsers have max limit for persistent connections per server.
    * Sending: Time needed to send request data to the server (upload time)
    * Waiting: Waiting for the response (till the first byte is received from the server), usually it is backend server response time, or called SPL (Server Processing Latency).
    * Receiving: Time necessary to download response body (download time)

For more info, check out this blog (firebug-net-panel-timings) from Software is hard

    * DOMContentLoaded event fires when parsing of the current page is complete. If this number is negative the event was fired before the request is started.
    * Load event fires when all files have finished loading from all resources, including ads and images. If this number is negative the event was fired before the request is started.

Microsoft has a test drive to explain DOMContentLoaded and Load 2 events.

Wednesday, November 17, 2010

Deadlock from sorted IN clause

Recently we see deadlock in Oracle db caused by a simple IN delete clause (the input parameters are alphabetically sorted). The SQL looks like
DELETE FROM TEST_TABLE WHERE ID IN (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10);

Stack Trace:
--- The error occurred while applying a parameter map.
--- Check the BatchDeleteUserContextByCred-InlineParameterMap.
--- Check the statement (update failed).
--- Cause: com.inet.ora.Ora4SQLException: [Oracle] #600 ORA-00060: deadlock detected while waiting for resource
[Oracle] #600
        at com.ibatis.sqlmap.engine.mapping.statement.GeneralStatement.executeUpdate(GeneralStatement.java:91)
        at com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.update(SqlMapExecutorDelegate.java:505)
        at com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.delete(SqlMapExecutorDelegate.java:528)
        at com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl.delete(SqlMapSessionImpl.java:98)
        at com.ibatis.sqlmap.engine.impl.SqlMapClientImpl.delete(SqlMapClientImpl.java:75)

Issue Analysis:
These Global Enqueue Services Deadlock detected messages are indicating that there is row level contention and locking issue caused by the application code. These errors are exactly the same as those seen in an ORA-0060 deadlock scenario except and these messages are indicating the RAC database has detected the deadlock and forced the rollback of a transaction as a result of the deadlock scenario. There is no Oracle RAC issue here, this needs to be addressed at an application coding level.

Why deadlock:
Simply put, two or more guys do the same tasks, there will be conflict. Two delete SQL will have the same problem though they are sorted like below (no official explanation of this, but the guess is Oracle will do SELECT before delete, without sequence guaranteed. )

Why we did "batch delete using IN clause"?
  1. One by one delete has performance issue - due to redo/undo log, replication cost
  2. Huge batch delete has issue too - causes rollback segment or data-files related errors.

  1. Move the cleanup task to DB job
  2. Use JDBC batch delete (multiple deletes with one commit, using addBatch(), executeBatch() etc)

Wednesday, November 3, 2010

Facebook Operations (Jonathan Heiliger)

Key takeaways:
  1. One team (engineer team drivers development, testing and deployment)
  2. Small team (several people for constant communication)
  3. Innovation (Hiphop, Scribe, Cassandra etc etc)

Best Practices:
  1. Version control everything
  2. Automate, automate, automate
  3. Plan to fail (constant failure)
  4. Instrument everything
  5. Don't waste time on dumb stuff
  6. Distribute accountability
  7. Focus on top features (e.g. feed)
  8. Monitor usage to understand customer
  9. Loosely coupled system

Friday, October 29, 2010

HTTP 1.1 Status Code Definitions

This is a quick reference to Http/1.1 status code categories, excerpt from http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html, part of RFC2616

Informational 1xx (100-101)

This class of status code indicates a provisional response, consisting only of the Status-Line and optional headers, and is terminated by an empty line. There are no required headers for this class of status code. Since HTTP/1.0 did not define any 1xx status codes, servers MUST NOT send a 1xx response to an HTTP/1.0 client except under experimental conditions.

Successful 2xx (200-206)

This class of status code indicates that the client's request was successfully received, understood, and accepted.

Redirection 3xx (301-307)

This class of status code indicates that further action needs to be taken by the user agent in order to fulfill the request. The action required MAY be carried out by the user agent without interaction with the user if and only if the method used in the second request is GET or HEAD. A client SHOULD detect infinite redirection loops, since such loops generate network traffic for each redirection.

Client Error 4xx (400-417)

The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server SHOULD include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents SHOULD display any included entity to the user.

Server Error 5xx (500-505)

Response status codes beginning with the digit "5" indicate cases in which the server is aware that it has erred or is incapable of performing the request. Except when responding to a HEAD request, the server SHOULD include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. User agents SHOULD display any included entity to the user. These response codes are applicable to any request method.

Health insurance

It is kind of hard to understand all these medical plans without reading their statements carefully. This year, company is changing the plan offers again to save cost, so next year's per-pay-check contribution will increase about 40%. Given that, I have to read the flyers carefully and attended one session to understand them better before make elections within enrollment period.

As I said, it is kind of confusing to engineers with so many acronyms. However, 10 minutes explanation will work well. Here I try to write down my understanding as a memo for future reference. For medical insurance plan, we need understand 3 categories, namely health plan, health account, and health service provider.

Health plan:
  1. PPO – Preferred provider organization.  A health plan that uses network and out-of-network providers.  Examples are Choice Plus (UHC) and Open Access Plus (CIGNA).
  2. EPO – Exclusive provider organization.  A health plan that uses network only providers.  Examples are Choice (UHC) and Open Access (CIGNA).
  3. OOA – Out of area plan.  Medical coverage for employees outside major network and metropolitan areas.
  4. HPSP – Health Plus Savings Plan.  Usually tax qualified high deductible health plan.
  5. HMO – Health maintenance organization.  A legally qualified health care organization that provides medical services in a geographic area.  Examples are Kaiser and Harvard Pilgrim Healthcare.
Health account:
  1. FSA - Flexible Spending Account (employee owes, use it or lose it)
  2. LPFSA - Limited Purpose Flexible Spending Account (employee owes, limited for vision and dental)
  3. HSA - Health Savings Account (employee owes, carry over with possible interest rate)
  4. HIA - Health Incentive Account (company owes, a.k.a Health Reimbursement Account)
Health service provider:
  1. CIGNA
  2. UHC(UnitedHealthCare)
  3. Kaiser
  4. Harvard Pilgrim
  1. Tax laws prohibit the rollover of HIA funds into an HSA.
  2. New medical reforms requires dependents up to 26 years old be covered in medical plan, disregarding employment/martial status.

High performance web site - reading notes (4)

11. Avoid Redirects
Redirects hurt performance
Response status code is 3xx for redirects (300-307, and 304 is for conditional Get)
Redirects delays html doc, CSS impacts rendering, JS impacts rendering and parallel download
Missing Trailing Slash
    Apache alias/mod_rewrite/DirectorySlash
Connecting Web Sites
Tracking internal traffic - referer logging
Tracking outbound traffic - beacon (http request contains tracking info in the URL)
Prettier URL - Avoid redirect using Alias, mod_rewrite, DirectorySlash and directly linking

12. Remove duplicate scripts
Unnecessary HTTP requests
Wasted JS execution
Implement a script management module in templating system
Script has a getVersion() function

13. Configure (Avoid) ETags
Entity tags - a mechanism that web servers and browsers use to validate cached components
ETag is a string, must be quoted, introduced in Http 1.1
If-Non-Match takes precedence over If-Modified-Since
ETag is typically constructed using attributes that make them unique to a specific server hosting a Web site
Apache ETag uses inode-size-timestampe, and FileETag directive removes inode
IIS ETag uses Filetimestamp:ChangeNumber (# of configuration changes to IIS)

14. Make ajax cacheable
Web2.0, DHTML, Ajax
Yahoo Mail caches ajax result
Use packet sniffer to monitor active/passive Ajax requests

Tuesday, October 26, 2010

High performance web site - reading notes (3)

6. Put (java)scripts at the bottom
Parallel downloads
Limiting parallel downloads to two per hostname is a guideline, new browsers expand to 4 or more for HTTP/1.1
Scripts block download
Use deferred scripts (DEFER attribute indicates the script does not contain document.write)

7. Avoid CSS Expressions
CSS expressions are a powerful and dangerous way to set CSS properties dynamically
CSS expressions are evaluated more frequently than most people expect.
One-Time Expressions
Event handlers

8. Make javascript and CSS external
In raw terms, inline is faster, but we need consider three metrics (page views, empty cache vs.primed cache, and component reuse).
post-onload download (document onload event,firebug highlights DOMContentLoaded, load events)

9. Reduce DNS lookup
Reduce the number of unique hostnames reduces the number of DNS lookups
Reduce the number of unique hostnames reduce the amount of parallel downloading
Use keep-alive to reuse an existing connection by voiding TCP/IP overhead

10. Minify javascript
Use minification instead of obfuscation (due to bugs, maintenance, debugging etc concerns)
Minify javascript using JSMin or dojo compressor (shrinksafe)

High performance web site - reading notes (2)

1. Make Fewer HTTP Requests
Image maps
CSS Sprites
Inline images (data: URL scheme)
    e.g. <img alt="red star" src="data:image/gif;base64,THE-BASE64-DATA-OF-IMAGE">
Combined javascripts and stylesheets

2. Use a Content Delivery Network
Mirror Image
SAVVIS (specialized in video content delivery)
Use keynote.com or gomez.com to test geographic locations

3. Add an Expires header
Cache-Control (max-age) which take precedence over Expires
Apache mod_expires
Empty cache vs. primed cache
revving filenames (add build version number), don't use query string

4. Gzip components
Accept-Encoding (Content-Encoding in response)
Image/PDF should not be gzipped (Gzip your scripts and stylesheets)
Gzip reduce by about 70%
Apache mod_gzip (mod_deflate)
Proxy caching uses Vary header (e.g. Vary: Accept-Encoding,User-Agent)
Update (5/17/2012)
Compress the Embedded OpenType font files used by Internet Explorer. EOT is a binary format, but it is not natively compressed
Compress favicon, while an image file, is not natively compressed

5. Put stylesheets at the top
Use Link instead of @import as @import rule causes unexpected ordering in how the components are downloaded
FOUC = Flash of unstyled content
Put stylesheets in the document HEAD using the LINK tag

High performance web site - reading notes (1)

Somehow I was assigned a new task to investigate front-end performance for an important project, and majorly about web site performance. This is a hot topic in Web2.0 era, and I did join the Velocity 2010 conference this June @Santa Clara. Most sessions were all about Web site performance, and behind the scene how to make web pages faster while dealing with HTML/JS/CSS/Images/Flash etc old friends. However, I have not worked on this layer for years, and almost forgot how to write CSS/JS efficiently, so need pick up quickly by reading.

Why High performance web site?
Two main reason: Steve is the author of YSlow and once was Chief Performance Yahoo! to lead a team focusing on yahoo performance (yahoo also published the best practices), and now he is with Google for performance. The book is well organized and easy to read and understand. - I am preparing to read his second book "Even faster Web site" now.

Top 14 rules:
There are many rules (best practices) regarding Web site performance from Yahoo, Google or other companies. But in this book, Steve listed top 14 rules and explained the ins and outs of these rules with examples and case study. I will not repeat his points word by word here, but as a reading notes, I will write down key take away from each rule. Therefore, the notes might not be complete sentence, or without context, or hard to fully understand. If you are interested, get one copy and read Steve's original words.
  1. Make few Http requests
  2. Use a CDN
  3. Add an Expires header
  4. Gzip components
  5. Put stylesheets at the top
  6. Put (java)scripts at the bottom
  7. Avoid CSS Expressions
  8. Make javascript and CSS external
  9. Reduce DNS lookup
  10. Minify javascript
  11. Avoid Redirects
  12. Remove duplicate scripts
  13. Configure (avoid) ETags
  14. Make ajax cacheable

Wednesday, October 20, 2010


This is an error we sometimes encounter in test environment. From JDK description (since JDK1.0), the error is Thrown if the Java Virtual Machine or a ClassLoader instance tries to load in the definition of a class (as part of a normal method call or as part of creating a new instance using the new expression) and no definition of the class could be found. The searched-for class definition existed when the currently executing class was compiled, but the definition can no longer be found (runtime).

Here are some cases of this error:
Case 1:
java.lang.NoClassDefFoundError: org/apache/log4j/Category) (Caused by org.apache.commons.logging.LogConfigurationException: No suitable Log constructor [Ljava.lang.Class;@2c773f1 for org.apache.commons.logging.impl.Log4JLogger (Caused by java.lang.NoClassDefFoundError: org/apache/log4j/Category))

Possible root cause: WebLogic cache somehow missed the class definition but didn't try to load it again.

The fix: Need a refresh to reload classes.
1.       Delete all the dir which named .wlnotdelete under beahome/ user_projects(or sub-dir)
2.       Delete all the files under upload dir under beahome/ user_projects(or sub-dir)
3.       Restart WLS and redeploy the package

Case 2:
java.lang.NoClassDefFoundError: Could not initialize class com.company.webapp.module.search.Proxy
Root cause: new instance has dependency on service manager for search engine endpoint look-up. When service manager is down, the constructor will throw exception. 
The fix is to provide default value or not throw exception in constructor.
Case 3:
java.lang.NoClassDefFoundError: Could not initialize class com.company.webapp.module.search.PathCache$PathCacheHolder com.company.webapp.module.search.PathCache.getInstance(PathCache.java:139)

Root cause: In preload servlet, the init() method needs to call service manager (another application on the same JVM) for component endpoint look-up, it will slow down tomcat open ports (because tomcat opens ports only after all servlet/listener from web apps are initialized, in this case, preload servlet is still initializing while waiting for service manager). 
The fix is to use another thread in init() method with 30 seconds wait-time to call service manager to speed up tomcat startup. With that, preload servlet will get initialized very fast, and tomcat will open ports after the application (with preload servlet) and service manager (with other lightweight servlet) are ready in the same tomcat.
Case 4:
javax.servlet.ServletException: Servlet execution threw an exception
java.lang.NoClassDefFoundError: Could not initialize class

Possible root cause: Emma build somehow causes runtime issue though the compiling was ok. We once met similar issue (Emma build caused securityexception). The reason was there was no any change to application filters between 2 daily builds, so we were confident the error was not caused by code change, but by daily build. 
The fix was to use previous build or use non-Emma build.
To sum up:
NoClassDefFoundError is error and usually is caused by coding practice. However, sometime environment issue (like build or application server bug) might also cause this error. The way to fix it is to analyze the error log for root cause, then figure out a solution. One guideline is to ensure the jar or class is in the classpath, and the constructor (class) or init (servlet) or getInstance (for singleton class) doesn't depend on other resources.

Tuesday, October 5, 2010

Groupon and the business model

Groupon www.groupon.com- enabling a group of people to purchase a deal online as a group, with coupon sharing and purchasing features, categorized by city for better buyer gathering. There are many similar sites for promotions and sharing the similar business model. Merchandisers pay service fees (or commission) to groupon, and groupon will provide the daily deal news and sell the coupons, the buyer redeem the coupons (via printing or mobile) at merchandiser store. This is a win-win strategy for both seller and buyer, and groupon makes profits via commission (or sth like that), which is very a good business model - very similar to real world marketplace which provides the platform/location for both seller and buyer, and itself gets the rent. There are more and more features can be added to groupon site, but the essential value for seller and buyer is win-win situation. Sellers can sell more and advertise themselves with promotions, and buyers can get overwhelming deals in local city stores.

Similar Sites:

Traditional Coupon Site:
http://dealsea.com/  - guess it profits from link, similar to Google Adsense
http://www.couponsherpa.com/ - this site is well-organized, and easy to find coupons for big merchandisers

Notes from Java security training

Threat Modeling:
Use Microsoft SDL Threat Modeling Tool to understand the system potential threats. Usually we have 4 steps, namely (1) Draw Diagrams (2) Analyze Model (3) Describe Environment (4) Generate Reports. We can focus on outfacing interface then features, and define process, data store, external interactor and data flow as well as (trust) boundary.

Here is one *.tms file snapshot:

Secure Development Life-cycle:
Organization or project group should define some secure development process to get security in the whole software development phase, from requirement analysis, design, development, deployment and so on.

Use JTest to fix insecure code:
Eclipse with JTest plugin provides better experience to help write secure code.

OWASP WebGoat Example:

  1. SDL Threat Modeling Tool needs Visio
  2. JTest is from Parasoft, which provides a bunch of features including security scan

Thursday, September 30, 2010


JCR Data Model


Repository -> Workspace -> Item (Node/Property)

Why PathNotFoundException?
From javadoc: Exception thrown when no Item exists at the specified path or when the specified path implies intermediary Nodes that do not exist.

From JCR troubleshooting topic: troubleshooting common exceptions: Within a workspace, each node is identified by its path and its uuid. In order to retrieve a node you must specify either its path or its uuid. The retrieval methods within the repository throw the exceptions "PathNotFoundException" and "ItemNotFoundException" to indicate that the node or property that you attempted to retrieve (either by path or uuid) does not exist in the current workspace.

Handle it property - it is actually a warning.

Thursday, September 23, 2010

cannot open '/etc/mail/local-host-names': World writable directory

What is World-writable?
A world-writable file or directory is one that anyone can alter. For directories this means that files can be created or deleted within it by anyone. Being world-writable is almost always a security risk.

Get below error when try to start sendmail (/etc/init.d/sendmail start)
/etc/mail/sendmail.cf: line 94: fileclass: cannot open '/etc/mail/local-host-names': World writable directory

  1. Grant w to /etc, /etc/mail, /etc/mail/* => not work
  2. Manually edit file sendmail.cf
replace line Fw/etc/mail/local-host-names with line
Fw-o /etc/mail/local-host-names

identically (if you got the same error for trusted-users)
replace line Ft/etc/mail/trusted-users with line
Ft-o /etc/mail/trusted-users

Shell script to monitor disk space

Space full caused db transaction hang issue last week, and this week it caused javax.jms.JMSException: No space left on device, herein ActiveMQ cannot write message to its queue. I decide to write a shell script to watch the disk space.

  1. Google for similar script and info
  2. Write the shell script 
  3. Start sendmail for email alert
  4. Add script to cronjob for auto monitoring
  1. /etc/init.d/sendmail => Usage: /etc/init.d/sendmail {start|stop|restart|condrestart|status}
  2. /etc/init.d/crond => Usage: /etc/init.d/crond {start|stop|status|reload|restart|condrestart}
Two options to add script to cronjob:
  1. crontab -e
  2. Put shell script to /etc/cron.* folder (cron.daily/   cron.hourly/  cron.monthly/ cron.weekly/)
# http://www.cyberciti.biz/tips/shell-script-to-watch-the-disk-space.html
# set admin email for alert

# set alert level
# set log folder to get deleted

df -HP | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{ print $5 " " $1 }' | while read output;
  echo $output
  used=$(echo $output | awk '{ print $1}' | cut -d'%' -f1 )
  partition=$(echo $output | awk '{ print $2 }' )
  if [ $used -ge $ALERT_LEVEL ]; then
    echo "Running out of space \"$partition ($used%)\" on $(hostname) as on $(date)" |
    mail -s "Alert: Running out of disk space - $used% used" $ADMIN_EMAIL
              echo "Purging log files ..."
              find $LOG_FOLDER -mtime +2 -exec rm -f {} \;
              echo "DONE"   


Why use df - HP?

linux01:root > df -H
Filesystem             Size   Used  Avail Use% Mounted on
/dev/sda2              8.0G   2.7G   5.0G  35% /
                       289G   203G    78G  73% /jdrive

linux01:root > df -H -t ext3
Filesystem             Size   Used  Avail Use% Mounted on
/dev/sda2              8.0G   2.7G   5.0G  35% /

linux01:root > df -HP
Filesystem             Size   Used  Avail Use% Mounted on
/dev/sda2              8.0G   2.7G   5.0G  35% /   289G   203G    78G  73% /jdrive


Monday, September 20, 2010

Emma build caused SecurityException

Web application throws below exception:

java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
at sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:221)
at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:176)
at java.util.jar.JarVerifier.processEntry(JarVerifier.java:277)
at java.util.jar.JarVerifier.update(JarVerifier.java:188)
at java.util.jar.JarFile.initializeVerifier(JarFile.java:321)
at java.util.jar.JarFile.getInputStream(JarFile.java:386)
at org.apache.catalina.loader.WebappClassLoader.findResourceInternal(WebappClassLoader.java:2018)
at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:1786)
at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:890)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1354)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1233)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

It was caused by emma build, not sure of the exact root cause, but using standard (non-emma) build will not have this issue.

  1. Usually the exception is complaining the signed archive is corrupted
  2. Emma is a open source java code coverage tool http://emma.sourceforge.net/

Friday, September 17, 2010

Tomcat is not responding due to ORA-00257

Web team reports that request to tomcat is not responding all of a sudden. The environment is an isolated one with access control, which means there is no system or application change recently.

Application Log:
Caused by: com.inet.ora.Ora4SQLException: [Oracle] #1 ORA-00257: archiver error. Connect internal only, until freed.
[Oracle] #1
    at com.inet.ora.i.a(Unknown Source)
    at com.inet.ora.OraConnection.a(Unknown Source)
    at com.inet.ora.OraConnection.q(Unknown Source)
    at com.inet.ora.OraConnection.<init>(Unknown Source)
    at com.inet.ora.l.<init>(Unknown Source)
    at com.inet.ora.i.a(Unknown Source)
    at com.inet.ora.OraDriver.connect(Unknown Source)
    at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
    at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:294)
    at org.apache.commons.dbcp.BasicDataSource.validateConnectionFactory(BasicDataSource.java:1247)
    at org.apache.commons.dbcp.BasicDataSource.createDataSource(BasicDataSource.java:1221)
    ... 46 more

Thread Dump:
"http-8080-12" daemon prio=10 tid=0x0000000046b20c00 nid=0x18e0 runnable [0x000000004954e000..0x000000004954fb90]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at com.inet.ora.OraConnection.i(Unknown Source)
    at com.inet.ora.OraConnection.a(Unknown Source)
    - locked <0x00002aab89aad9d0> (a com.inet.ora.l)
    at com.inet.ora.ax.a(Unknown Source)
    at com.inet.ora.ax.c(Unknown Source)
    at com.inet.ora.ax.execute(Unknown Source)
    at org.apache.commons.dbcp.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:169)
    at com.ibatis.sqlmap.engine.execution.SqlExecutor.executeUpdate(SqlExecutor.java:81)
    at com.ibatis.sqlmap.engine.mapping.statement.GeneralStatement.sqlExecuteUpdate(GeneralStatement.java:200)
    at com.ibatis.sqlmap.engine.mapping.statement.GeneralStatement.executeUpdate(GeneralStatement.java:78)
    at com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.update(SqlMapExecutorDelegate.java:505)
    at com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl.update(SqlMapSessionImpl.java:90)
    at com.ibatis.sqlmap.engine.impl.SqlMapClientImpl.update(SqlMapClientImpl.java:67)

Root Cause:
Oracle archived redo logs used up disk space. Linux 'df' shows the space is 100% used.

As the log mentioned, freed space. We did a quick move of archived log files to another disk for the fix. Meanwhile, we are discussing 2 options: either turn on NOARCHIVELOG, or write a script to do automatic gzip and move to other disk.

Eclipse: Class file name must end with .class

Today after I switched to another workspace to work on another CVS branch. I got error in Java search "Class file name must end with .class" when I type Ctrl+Shift+G (for reference in workspace).

Options I tried:
  1. Clean all projects and rebuild workspace (not work)
  2. Close the projects and reopen again (not work)
  3. Close Eclipse and start again (not work)
  4. Add -clean in Eclipse startup script (e.g. C:\Workshop\eclipse\eclipse.exe  -clean), and this worked!!
Another option I didn't try:
     a.  Close Eclipse
     b.  Delete <workspace>/.metadata/.plugins/org.eclipse.jdt.core/*.index
     c.  Delete <workspace>/.metadata/.plugins/org.eclipse.jdt.core/savedIndexNames.txt
     d.  Start Eclipse again 

See http://stackoverflow.com/questions/6816/eclipse-class-file-name-must-end-with-class-exception-in-java-search

Root Cause:
Unknown, but mostly like it is Eclipse cache issue. My Eclipse version is as below
Eclipse Platform
Version: 3.4.1
Build id: M20080911-1700

Thursday, September 16, 2010

Oracle redo log slows down application

J2EE application on tomcat server suddenly has high latency, and has very slow server side processing response, and all 150 threads are used up. Regular request even takes more than 30 seconds while in normal case it takes around 100ms.

Production Info:
  1. SW: Oracle 10g RAC, Tomcat6.0, JDK1.6, CentOS4.4
  2. No production outage or HA failover/failback
  3. No stress test or peak load
Root Cause:
NFS mount point hung which in turn slowed the archiving the logs to the NFS mount point, so the redos were not getting archived fast enough, and caused the latency.

Tuesday, September 7, 2010

Suppress Findbugs warnings

Last week one colleague sent me an email to confirm a false warning from Findbugs. He explained it was SpringFramework init method, so need "Write to static field from instance method". As we know, it is a bad practice if multiple instances are manipulated. But here we need suppress the warning.

    public void setApplicationContext(ApplicationContext applicationContext) {
        context = (ApplicationContext) applicationContext;

Findbugs, PMD, Checkstyle, JTest and etc are popular static java code analysis tools. Findbugs uses bug detectors (pluggable) to analyze java bytecode (compiled class files) based on certain bug patterns. It self uses BCEL, ASM and dom4j etc open source libraries.

Install Findbugs to Eclipse IDE is straightforward, and same as other Eclipse plugin. (Help -> Software Updates... -> Available Software -> Add Site using http://findbugs.cs.umd.edu/eclipse/). I use Findbugs-1.3.9 and JDK1.6 in the test of suppress code.

Run Findbugs:
After installation and restart Eclipse, in Package Explorer, right-click to find "Find Bugs" menu. We can configure to run Findbugs automatically.

Suppress Findbugs warning:
Unlike PMD leverages the @java.lang.SuppressWarnings annotation, or JTest uses predefined comments, Findbugs defines its own annotation in edu.umd.cs.findbugs package, so we need use below annotation

    @edu.umd.cs.findbugs.annotations.SuppressWarnings(value =
        "ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD", justification = "Spring init method")
    public void setApplicationContext(ApplicationContext applicationContext) {
        context = (ApplicationContext) applicationContext;


    import edu.umd.cs.findbugs.annotations.SuppressWarnings
    @SuppressWarnings(value =
        "ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD", justification = "Spring init method")
    public void setApplicationContext(ApplicationContext applicationContext) {
        context = (ApplicationContext) applicationContext;


Redirects handling in httpclient

Httpclient cannot get correct response from production server (http://production.domain.com/servlet.do)

Root cause:
In production, http request will be redirected to https by load balancer (302 redirect code). There are a few types of redirect that HttpClient can't handle automatically, including 302, in entity enclosing methods (POST and PUT).

    * 301 Moved Permanently. HttpStatus.SC_MOVED_PERMANENTLY
    * 302 Moved Temporarily. HttpStatus.SC_MOVED_TEMPORARILY
    * 303 See Other. HttpStatus.SC_SEE_OTHER
    * 307 Temporary Redirect. HttpStatus.SC_TEMPORARY_REDIRECT

    * 300 Multiple Choices. HttpStatus.SC_MULTIPLE_CHOICES
    * 304 Not Modified. HttpStatus.SC_NOT_MODIFIED
    * 305 Use Proxy. HttpStatus.SC_USE_PROXY

Handling redirects manually in codes for entity enclosing methods.

Sample Code:
    // Create an instance of HttpClient.
    HttpClient client = new HttpClient();
    // Create a method instance.
    //HttpMethod method = new GetMethod(url); //good with redirects unless setFollowRedirects(false)
    PostMethod method = new PostMethod(url); //entity enclosing methods
    // Provide custom retry handler is necessary
            new DefaultHttpMethodRetryHandler(3, false));
    try {
      // Execute the method.
      int statusCode = client.executeMethod(method);
      if (statusCode != HttpStatus.SC_OK) {

          Header locationHeader = method.getResponseHeader("location");
          if (locationHeader != null) {
              method = new PostMethod(locationHeader.getValue());
              statusCode = client.executeMethod(method);

      // Read the response body.
      byte[] responseBody = method.getResponseBody();

      // Deal with the response.
      // Use caution: ensure correct character encoding and is not binary data
      System.out.println(new String(responseBody));

    } catch (HttpException e) {
      System.err.println("Fatal protocol violation: " + e.getMessage());
    } catch (IOException e) {
      System.err.println("Fatal transport error: " + e.getMessage());
    } finally {
      // Release the connection.
  1. Http specification defines two entity enclosing methods, POST and PUT.
  2. Redirects is automatically handled by httpclient in GET method, unless setFollowRedirects(false).
Httpclient is a popular http1.1 compliant agent implementation, open sourced at apache. There are 2 typical versions now, one is legacy commons httpclient (v3.1), and the other is the new HttpComponents (v4.x), successor of and replacement of commons-httpclient.

commons-httpclient-3.1.jar (org.apache.commons.httpclient)
commons-codec-1.4.jar (org.apache.commons.codec)

httpclient-4.0.1.jar (org.apache.http)
httpcore-4.0.1.jar (org.apache.http)


Friday, September 3, 2010

Eclipse is running in a JRE, but a JDK is required

After I install m2eclipse plugin, every time I launch eclipse IDE, I get an alert saying "Eclipse is running in a JRE, but a JDK is required".

Root Cause:
Maven needs some jar files from JDK distribution.

Add -vm  option to eclipse.ini before -vmargs option.

Here is copy of my eclipse.ini (Windows OS, Eclipse 3.4.1)
C:\Program Files\Java\jdk1.6.0_20\jre\bin\client\jvm.dll

  1. The -vm option must come before the -vmargs, since everything after -vmargs is passed to the JVM instead of Eclipse
  2. The -vm option and its value (the path to the JVM) must be on separate lines.
  3. The path to the JVM can be to java.exe, javaw.exe or jvm.dll per Eclipse wiki page, but my test shows jvm.dll works.
  4. If the path with white space (e.g. C:\Program Files) cannot work, double quote the path.
  5. I also set -vmargs option in Eclipse start up command line (e.g. C:\Workshop\eclipse\eclipse.exe  -clean -showlocation WAPI2.0 -refresh -vmargs -Xms256M -Xmx384M)