General DMA zone rework

Background:

The 16MB Linux ZONE_DMA has some long standing problems on
x86. Traditionally it was designed only for ISA dma which is limited to
24bit (16MB). This means it has a fixed 16MB size.

On 32bit i386 with its limited virtual memory space the next zone is
lowmem with ~900MB (on default split) which works for a lot of devices,
but not all.  But on x86-64 the next zone is only 4GB (DMA32) which is too
big for quite a lot more devices (typically 30,31 or 28bit limitations).

While the DMA zone is in a true VM zone and could be in theory used for
any user allocations in practice the VM has a concept called lower zone
protection to avoid low memory deadlocks that keeps ZONE_DMA nearly
always free unless the caller specifies GFP_DMA directly. This means
in practice it does not participate in the rest of the automatic VM
balancing and its memory is essentially reserved.

Then there is another pool used on x86-64: the swiotlb pool. It is 64MB
by default and used to bounce buffer in the pci DMA API in the low level
drivers for any devices that have 32bit limitations (very common)

Swiotlb and the DMA zone already interact. For consistent mappings swiotlb
will already allocate from both the DMA zone and from the swiotlb pool as
needed. On the other hand swiotlb is a truly separate pool not directly
visible to the normal zone balancing (although it happens to be in the
DMA32 zone). In practice ZONE_DMA behaves very similar in that respect.

Driver interfaces:

When drivers need DMA able memory they typically use the pci_*/dma_*
interfaces which allow specifying device masks.   There are two interfaces
here:

dma_alloc_coherent/pci_alloc_consistent to get a block of coherent
memory honouring a device DMA mask and mapped into an IOMMU as needed.
And pci_map_*/dma_map_* to remap an arbitary block to the DMA mask of
the device and into the IOMMU.

Both ways have their own disadvantages: coherent mappings can have some
bad performance penalties and high setup costs on some platforms which
are not full IO coherent, so they are not encouraged for high volume
driver data.  And pci/dma_map_* will always bounce buffer on the common
x86 swiotlb case so it might be quite expensive. Also on a lot of IOMMU
implementations (in particularly x86 swiotlb/pci-gart) pci/dma_map_*
does not support remapping to any DMA masks smaller than 32bit so it
cannot actually be used for ISA or any other device with <32bit DMA mask.

Then there is the old style way of directly telling the allocators
what memory you need by using GFP_DMA or GFP_DMA32 and then later using
bounce less pci/dma_map_*. That also has its own set of problems: first
GFP_DMA varies between architectures. On x86 it is always 16MB, on IA64
4GB, on some other architectures it doesn't exist at all or has other
sizes. GFP_DMA32 often doesn't exist at all (although it can be often
replaced with GFP_KERNEL on 32bit platforms).  This means any caller
needs to have knowledge about its platform which is often non portable.

Then the other problem is that it these are only single bits into small
fixed zones. So for example if a user has a 30bit DMA zone limit on 64bit
they have no other choice than to use GFP_DMA and when they need more than
16MB of memory they lose.  On the other hand on a lot of other boxes which
don't have any devices with <4GB dma masks ZONE_DMA is just wasted memory.

Then GFP_DMA is also not a very clean interface. It is usually
not documented what device mask is really needed. Also some driver
authors misunderstand it as meaning "required for any DMA" which is
not correct. And often it actually requires dma masks larger than 24bit
(16MB) so the fixed 24bit on x86 is limiting.

The pci_alloc_consistent implementation on x86 also has more problems:
usually it cannot use an IOMMU to remap the dma memory so they actually
have to allocate memory with physical addresses according to the dma mask
of the passed device. All they can do for this is to map it to GFP_DMA
(16MB small), GFP_DMA32 (big, but sometimes too big) or by getting
it from the swiotlb pool. It also attempts to get fitting memory from
the main allocator. That works mostly, but has bad corner cases and is
quite inelegant.

In practice this leads to various unfortunate situations: either the 64bit
system has upto 100MB of reserved memory wasted (ZONE_DMA + swiotlb),
but does not have any devices that require bouncing to 16MB or 4GB.

Or the system has devices that need bouncing to <4GB, but the pools in
their default size are too small and can overflow. There are various
hac^wworkarounds in drivers for this problem, but it still causes
problems for users.  The ZONE_DMA can also not be enlarged because a 
lot of drivers "know" that it is only 16MB and expect 16MB memory from it.

On 32bit x86 the problem is a little less severe because of the 900MB
ZONE_NORMAL which fits most devices, but there are still some problems
with more obscure devices with sufficiently small DMA masks. And ISA
devices still fit in badly.

Requirements for a solution:

There is clearly a need for a new better low memory bounce pool on x86.
It must be larger than 16MB and actually be variable sized. Also the
driver interfaces are inadequate. All DMA memory allocation should
specify what mask they actually need. That allows to extend the pool
and use a single pool for multiple masks.

The new pool must be isolated from the rest of the VM. Otherwise it
cannot be safely used in any device driver paths who cannot necessarily
safely allocate memory (e.g. the block write out path is not allowed to
do this to avoid deadlocks while swapping) The current effective pools
(ZONE_DMA, swiotlb) are already isolated in practice so this won't make
much difference.

Proposed solution:

I chose to implement a new "maskable memory" allocator to solve these
problems. The existing page buddy allocator is not really suited for
this because the data structures don't allow cheap allocation by physical 
address boundary. 

The allocator has a separate pool of memory that it grabs at boot
using the bootmem allocator. The memory is part of the lowest zone,
but practically invisible to the normal page allocator or VM.

The allocator is very simple: it works with pages and uses a bitmap to 
find memory. It also uses a simple rotating cursor through the bitmap.
It is very similar to the allocators used by the various IOMMU 
implementations.  While this is not a true O(1) allocator, in practice
it tends to find free pages very quickly and it is quite flexible.
Also it makes it very simple to allocate below arbitary address boundaries.
It has one advantage over buddy in that it doesn't require all
blocks to be size of power of two. It only rounds to pages. So especially 
larger blocks tend to have less overhead.

The allocator knows how to fall back to the other zones if 
the mask is sufficiently big enough, so it can be used for 
arbitrary masks.

I chose to only implement a page mask allocator only, not "kmalloc_mask",
because the various drivers I looked at actually tended to allocate 
quite large objects towards a page. Also if a sub page allocator 
is really needed there are several existing ones that could be 
relatively easily adopted (mm/dmapool.c or the lib/bitmap.c allocator)
on top of an page allocator.

The maskable allocator's pool is variable sized and the user can set 
it to any size needed (upto 2GB currently). The default sizing 
heuristics are for now the same as in the old code: by default
all free memory below 16MB is put into the pool (in practice that
is only ~8MB or so usable because the kernel is loaded there too) 
and swiotlb is needed another 64MB of low memory are reserved too.
The user can override this using the command line.
Any other subsystems can also increase the memory reservation
(but this currently has to happen early while bootmem is still active)

In the future I hope to make this more flexible. In particular
the low memory could not be fully reserved, but only put into
the "moveable" zone and then later as devices are discovered
and e.g. block devices are set up this pre reservation could
be actually reserved. This would then actually allow to use
a lot of the ~100MB that currently go to waste on x86-64.  But
so far that is not implemented yet.

swiotlb doesn't maintain an own pool anymore, but just allocates
using the mask allocator.  THis is safe because the maskable
pool is isolated from the rest of the VM and not prone
to OOM deadlocks. This is admittedly more a heuristic, than 
a strict 100% guarantee, but it is not worse than the old swiotlb.
Also all users of the maskable allocators currently are benign
and won't overflow it in dynamic situations I believe. 

The internal implementations

ZONE_DMA is disabled for x86 with the maskable allocator enabled.

The maskable allocator requires some simple modifications in
the architecture start up code. These are currently only 
done for x86. All other architectures keep the same GFP_DMA
semantics as they had before. Adapting other architectures
wouldn't be very difficult.

The longer term goal is to convert all GFP_DMA allocations
to always specify the correct mask and then eventually remove
GFP_DMA.

Especially I hope kmalloc/kmem_cache_alloc GFP_DMA can be
removed soon. I have some patches to eliminate those users.
Then slab wouldn't need to maintain DMA caches anymore.

This patch kit only contains the core changes for the actual
allocator, swiotlb conversion and pci_alloc_consistent/dma_alloc_coherent()
Between swiotlb and the later changes this already means that
a lot of drivers use it.

The existing GFP_DMA users transparently fall back to
a maskable allocation with 16MB.

I have various other driver conversions in the pipeline, but 
I will post these sepately to not distract too much from
the review of the main code.

-Andi