通过 Ping 服务的接口抓取最新发布文章的网站列表

adzon · 2020 年1 月 10 日 22:13

（注意，示例代码中的接口已过期，文章仅供参考）

很早之前，会收集一些网站列表用于一些特殊用途。

当时研究了很多种方式，其中通过 Ping 服务的接口抓取最新发布文章的网站列表，是个比较好的方式。

进程一直挂在那里，就可以一直有新的网站提交进来，还是很爽的。

刚好 VIP 群里有朋友在问，我就把这个方法放出来。

直接上示例代码：

<?php

namespace App\Console\Commands;

use App\Models\Sources;
use Illuminate\Console\Command;
use Mockery\Exception;

class FetchPingXml extends Command
{
    /**
     * The name and signature of the console command.
     *
     * @var string
     */
    protected $signature = 'fetch:ping';

    /**
     * The console command description.
     *
     * @var string
     */
    protected $description = 'Fetch Ping Xml';

    /**
     * Create a new command instance.
     *
     * @return void
     */
    public function __construct()
    {
        parent::__construct();
    }

    /**
     * Execute the console command.
     *
     * @return mixed
     */
    public function handle()
    {
        //
        do{

            try{

                $time = time();
                $this->info("Loops Start: ".date("Y-m-d H:i:s",time()));
                $url = [];
                $seen = [];
                $content = file_get_contents("http://www.weblogs.com/pingservice?action=weblogsping&1=$time");
                $array = self::parse($content);
                foreach ($array['weblog'] as $value)
                {
                    $content = $value['@attributes'];
                    $url = $this->getHostFromUrl($content['url']);
                    if($url && !isset($seen[$url]) && strpos($url,'.')){
                        $this->info($url);
                        Sources::insertIgnore(['domain' => $url, 'cms' => 'unknown']);
                        $seen[$url] = true;
                    }
                }

                $content = file_get_contents("http://rpc.weblogs.com/shortChanges.xml");
                $array = self::parse($content);
                foreach ($array['weblog'] as $value)
                {
                    $content = $value['@attributes'];
                    $url = $this->getHostFromUrl($content['url']);
                    if($url && !isset($seen[$url]) && strpos($url,'.')){
                        $this->info($url);
                        Sources::insertIgnore(['domain' => $url, 'cms' => 'unknown']);
                        $seen[$url] = true;
                    }
                }

                $content = file_get_contents("http://blogsearch.google.com/changes.xml?last=300");

                preg_match_all ('/url="(.*?)"/',$content,$match );

                $array = $match[1];
                foreach($array as $value)
                {
                    $url = $this->getHostFromUrl($value);
                    if($url && !isset($seen[$url]) && strpos($url,'.')){
                        $this->info($url);
                        Sources::insertIgnore(['domain' => $url, 'cms' => 'unknown']);
                        $seen[$url] = true;
                    }
                }
                $this->info("Loops End: ".date("Y-m-d H:i:s",time()));
            }catch (Exception $e)
            {
                //do noting.
                $this->warn("Error!");
            }

        }while(1);

    }

    private function getHostFromUrl($url)
    {
        $block = [
            '.xyz','wordpress.com','.tk'
        ];

        $url = strtolower($url);

        foreach ($block as $each)
        {
            if(strpos($url,$each)) return false;
        }

        $parts = parse_url($url);
        if(!isset($parts['host'])) return false;
        preg_match("@([^/.]+)\.([^.]{2,6}(?:\.[^.]{2,3})?)$@", $parts['host'], $host);
        return array_shift($host);
    }

    /**
     * XML 转换为数组.
     *
     * @param string $xml XML string
     *
     * @return array
     */
    public static function parse($xml)
    {
        $data = simplexml_load_string($xml, 'SimpleXMLElement', LIBXML_NOCDATA | LIBXML_NOBLANKS);
        if (is_object($data) && get_class($data) === 'SimpleXMLElement') {
            $data = self::arrarval($data);
        }
        return $data;
    }

    /**
     * 把对象转换成数组.
     *
     * @param string $data 数据
     *
     * @return array
     */

    private static function arrarval($data)
    {
        if (is_object($data) && get_class($data) === 'SimpleXMLElement') {
            $data = (array) $data;
        }
        if (is_array($data)) {
            foreach ($data as $index => $value) {
                $data[$index] = self::arrarval($value);
            }
        }
        return $data;
    }

}

注意，之前测试的时候，想过使用多进程的方式，想着全球网站数量那么多，肯定一直在更新。但是后来发现，多进程获取的站点列表重复率非常高，所以你们使用的时候可以再测试下，看下是否有使用的意义。

12115655 · 2020 年1 月 11 日 02:26

这个有用，有空测试下

809140297 · 2020 年1 月 11 日 02:51

感觉很有用！！小白不知道怎么用，可惜…可惜…

tes4298 · 2020 年1 月 11 日 14:01

试了下，这个地址抓不到数据呀难道姿势错了?

adzon · 2020 年1 月 11 日 19:56

这是我13年写的代码，用到17年。别人问到我就放出来了，没有经过验证

刚才验证了一下，发现这3个接口已经对外关闭了，我先去查下有没有新的资料，如果完全关闭的话，这篇文章我会标记为已过期的经验。

ooppxxx · 2020 年1 月 17 日 03:50

就算过期了也是挺不错的思路，感谢L大分享~！