In the next few years applications will be utilizing petascale systems with tens to hundreds of thousands of processors, hundreds of I/O nodes, and thousands of disks. This two order of magnitude leap in scale from typical systems today is causing a critical gap to open up in fault management of these systems. Currently, systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. This talk will describe the concept of holistic fault tolerance, which takes into account the full impact of faults at all levels: the hardware, OS, middleware, and application. Such integration will make possible a level of fault prediction, notification, management, and recovery that is impossible today but critical to the productive use of the petascale systems of tomorrow. From the application viewpoint as the size of the computer increases the MTBF decreases and the time to write out all the memory (to checkpoint) increases. Thus at some point checkpoint/restart is no longer a viable way to handle faults. New algorithms may be needed. This talk will also describe the development of several new algorithms that have "Natural Fault Tolerance" and describe their behavior on a 100,000 proc simulator we have built to test scalability and fault tolerance petascale applications.