Hello,
Can you please explain the actual use of the system, I mean why you would like to extract text from millions of websites? This may will help me think of a solution.
Now, for low memory footprint we always use core PHP and no library, but then you are scraping millions of sites which means you have to use a library to process the HTML. The best in this case will be XML DOM parser.
I think you don't need to visit the website 3 times, this will be 3 times the memory consumption. You need to develop a strong and vast parser. Which can be done using experiences of previous extractions.
The whole point is, you cannot built a system in a day, you have to keep it modifying and keep it making better day by day, at least when you are talking about millions of websites.
Feel free to contact me anytime.
Thanks
Meeshal k